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Abstract 

Virtual synchrony is an important abstraction that is proven to be 
extremely useful when implemented over asynchronous, typically large, 
message-passing distributed systems. Fault tolerant design is a key cri¬ 
terion for the success of such implementations. This is because large 
distributed systems can be highly available as long as they do not de¬ 
pend on the full operational status of every system participant. Namely, 
they employ redundancy in numbers to overcome non-optimal behavior 
of participants and to gain global robustness and high availability. 

Self-stabilizing systems can tolerate transient faults that drive the sys¬ 
tem to an arbitrary unpredicted configuration. Such systems automati¬ 
cally regain consistency from any such arbitrary configuration, and then 
produce the desired system behavior. Practically self-stabilizing systems 
ensure the desired system behavior for practically infinite number of suc¬ 
cessive steps e.g., 2 64 steps. 

We present the first practically self-stabilizing virtual synchrony algo¬ 
rithm. The algorithm is a combination of several new techniques that may 
be of independent interest. In particular, we present a new counter algo¬ 
rithm that establishes an efficient practically unbounded counter, that 
in turn can be directly used to implement a self-stabilizing Multiple- 
Writer Multiple-Reader (MWMR) register emulation. Other components 
include self-stabilizing group membership, self-stabilizing multicast, and 
self-stabilizing emulation of replicated state machine. As we base the repli¬ 
cated state machine implementation on virtual synchrony, rather than 
consensus, the system progresses in more extreme asynchronous execu¬ 
tions in relation to consensus-based replicated state machine. 
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1 Introduction 


Virtual Synchrony (VS) has been proven to be very important in the scope 
of fault-tolerant distributed systems [5]. The VS property ensures that two 
or more processors that participate in two consecutive communicating groups 
should have delivered the same messages. Systems that support the VS abstrac¬ 
tion are designed to operate in the presence of fail-stop failures of a minority 
of the participants. Such a design fits large computer clusters, datacenters 
and cloud computing, where at any given time some of the processing units 
are non-operational. Systems that cannot tolerate such failures degrade their 
functionality and availability to the degree of unuseful systems. 

Group communication systems that realize the VS abstraction provide ser¬ 
vices, such as group membership and reliable group multicast. The group mem¬ 
bership service is responsible for providing the current group view of the recently 
live and connected group members, i.e., a processor set and a unique view iden¬ 
tifier , which is a sequence number of the view installation. The reliable group 
multicast allows the service clients to exchange messages with the group mem¬ 
bers as if it was a single communication endpoint with a single network address 
and to which messages are delivered in an atomic fashion, thus any message is 
either delivered to all recently live and connected group members prior to the 
next message, or is not delivered to any member. The challenges related to 
VS consist of the need to maintain atomic message delivery in the presence of 
asynchrony and crash failures. VS facilitates the implementation of a replicated 
state machine [5] that is more efficient than classical consensus-based implemen¬ 
tations that start every multicast round with an agreement on the set of recently 
live and connected processors. It is also usually easier to implement [5]. To the 
best of our knowledge, no self-stabilizing virtual synchrony solution exists. 

Transient violations of design assumptions can lead a system to an arbitrary 
state. For example, the assumption that error detection ensures the arrival of 
correct messages and the discarding of corrupted messages, might be violated 
since error detection is a probabilistic mechanism that may not detect a corrupt 
message. As a result, the message can be regarded as legitimate, driving the 
system to an arbitrary state after which, availability and functionality may 
be damaged forever, requiring human intervention. In the presence of transient 
faults, large multicomputer systems providing VS-based services can prove hard 
to manage and control. One key problem, not restricted to virtually synchronous 
systems, is catering for counters (such as view identifiers) reaching an arbitrary 
value. How can we deal with the fact that transient faults may force counters 
to wrap around to the zero value and violate important system assumptions 
and correctness invariants, such as the ordering of events? A self-stabilizing 
algorithm [10] can automatically recover from such unexpected failures, possibly 
as part of after-disaster recovery or even after benign temporal violations of the 
assumptions made in the design of the system. We tackle this issue in our work. 

Contributions. We present the first self-stabilizing virtual synchrony solu¬ 
tion. Specifically: 
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• We provide a self-stabilizing counter algorithm using bounded memory 
and communication bandwidth, and yet (many writers) can increment the 
counter for an unbounded number of times in the presence of processor 
crashes and unbounded communication delays. 

• Our counter algorithm is modular with a simple interface for increasing 
and reading the counter, as well as providing the identifier of the processor 
that has incremented it. 

• At the heart of our counter algorithm is the underlying labeling algorithm 
that extends the label scheme of Alon et al. [1] to support multiple writ¬ 
ers, whilst the algorithm specifies how the processors exchange their label 
information in the asynchronous system and how they maintain proper 
label bookkeeping so as to “discover” the greatest label and discard all 
obsolete ones. 

• An immediate application of our counter algorithm is a self-stabilizing 
MWMR register emulation. 

• The self-stabilizing counter algorithm, together with the proposed imple¬ 
mentations of a self-stabilizing reliable multicast service and membership 
service, are composed to yield a self-stabilizing VS-based State Machine 
Replication (SMR) implementation. 

Related Work. Leslie Lamport was the first to introduce SMR, presenting it 
as an example in [17]. Schneider [20] gave a more generalized approach to the 
design and implementation of SMR protocols. Group communication services 
can implement SMR by providing reliable multicast that guarantees VS [4]. 
Birman et al. were the first to present VS and a series of improvements in 
the efficiency of ordering protocols [6]. Birman gives a concise account of the 
evolution of the VS model for SMR in [5]. 

Research during the last recent decades resulted in an extensive literature on 
ways to implement VS and SMR, as well as industrial construction of such sys¬ 
tems. A recent research line on (practically) self-stabilizing versions of replicated 
state machines [1, 9, 13, 14] obtains self-stabilizing replicated state machines in 
shared memory as well as in synchronous and asynchronous message passing 
systems. 

The bounded labeling scheme and the use of practically unbounded sequence 
numbers proposed in [1], allow the creation of self-stabilizing bounded-size so¬ 
lutions to the never-exhausted counter problem in the restricted case of a single 
writer. In [9] a self-stabilizing version of Paxos was developed that led to a 
self-stabilizing consensus-based SMR implementation. To this end, they extend 
the labeling scheme of [1] to allow for multiple counter writers, since unbounded 
counters are required for ballot numbers. Extracting this scheme for other uses 
does not seem intuitive. We present a simpler and significantly more communi¬ 
cation efficient self-stabilizing (bounded-size never-exhausted) counter that also 
supports many writers, where a single label rather than a vector of labels needs 
to be communicated. Our solution is highly modular and can be easily used in 
any similar setting requiring such counters. 
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Practically-stabilizing VS and self-stabilizing VS are identical when VS is 
defined by the behaviour of classical VS algorithms that use (bounded) coun¬ 
ters. These algorithms preserve the VS requirements as long as the counters 
do not reach their upper bound. In our setting, if a counter reaches the upper 
bound due to a transient fault our self-stabilizing/practically-stabilizing solu¬ 
tion introduces a new epoch with new sequence numbers. It, thus, converges 
to act exactly as the non-stabilizing VS (for the same number of steps) as an 
initialized non-stabilizing VS algorithm. 

Next, in Section 2, we overview our construction, describing the core tech¬ 
niques and the way they establish the desired properties. In Section 3 we present 
the model of computation we consider. Section 4 details the self-stabilizing 
Labeling and Increment Counter algorithms. In Section 5 we detail the self- 
stabilizing Virtual Synchrony algorithm and the resulting replicate state ma¬ 
chine emulation. We conclude in Section 6. 


2 Our Results in a Nutshell 

We start with the necessary succinct description of the system settings (more 
details in Section 3). We consider an asynchronous message passing system con¬ 
sisting of n communicating processors; each with a unique identifier. We assume 
that up to a minority of the processors might become inactive. The communica¬ 
tion network topology is of a fully connected graph. Any message that is sent in¬ 
finitely often from one active processor to another active processor is eventually 
received. We often use the term packets for low-level messages, distinguishing 
packets that are retransmitted to ensure delivery of high-level messages exactly 
once. Moreover, we assume that the communication links have known bounded 
capacity, and thus we can use existing self-stabilizing data-link layer algorithms 
for emulating reliable FIFO communication channel protocols that can even 
tolerate message omission, duplication as well as transient faults [11, 12]. 

2.1 Bounded labeling scheme for multiple writers 

As mentioned, Alon et al. [1] presented a bounded labeling scheme to implement 
an SWMR register emulation in a message-passing system. The labels (also 
called epochs ) allow the system to stabilize, since once a label is established, 
the integer counter related to this label is considered to be practically infinite, 
as a 64-bit integer is practically infinite and sufficient for the lifespan of any 
reasonable system. We extend the labeling scheme of [1] to support multiple 
writers, by including the epoch creator (writer) identity to break symmetry, 
and decide which epoch is the most recent one, even when two or more creators 
concurrently create a new label. 

When all processors (and hence potential writers) are active, the scheme 
can be viewed as a simple extension of the one of [1]. Informally speaking, 
the scheme assures that each processor pi eventually “cleans up” the system 
from obsolete labels of which p j appears to be the creator (for example, such 


3 



labels could be present in the system’s initial arbitrary state). Specifically, pi 
maintains a bounded FIFO history of such labels that it has recently learned, 
while communicating with the other processors, and creates a label greater than 
all that are in its history; call this p,’s local maximal label. In addition, each 
processor seeks to learn the globally maximal label , that is, the label in the system 
that is the greatest among the local maximal ones. Unfortunately, when some 
processors are not active, finding a global maximal becomes challenging, since 
these processors will not “clean up” their local labels. So, roughly speaking, the 
active processors need to do this indirectly without knowing which processors 
are inactive. To overcome this problem, we have each processor maintaining 
bounded FIFO histories on labels appearing to have been created by other 
processors. These histories eventually accumulate the obsolete labels of the 
inactive processors. We show that even in the presence of (a minority of) inactive 
processors, starting from an arbitrary state, the system eventually converges to 
use a global maximal label. 

Let us explain why obsolete labels from inactive processors can create a 
problem when no one ever cleans (cancels) them up. Consider a system starting 
in a state that includes a cycle of labels £\ -< £2 < £3 -< l\, all of the same 
creator, say p x , where -< is the label order relation. If p x is active, it will 
eventually learn about these labels and introduce a label greater than them all. 
But if p x is inactive, the system’s asynchronous nature may cause a repeated 
cyclic label adoption, especially when p x has the greatest processor identifier, 
as these identifiers are used to break symmetry. Say that an active processor 
learns and adopts t\ as its global maximal label. Then, it learns about £2 and 
hence adopts it, while forgetting about £\. Then, learning of £3 it adopts it. 
Lastly, it learns about £\, and as it is greater than £ 3 , it adopts £\ once more, as 
the greatest in the system; this can continue indefinitely. By using the bounded 
FIFO histories, such labels will be accumulated in the histories and hence will 
not be adopted again, ending this vicious cycle. 

2.2 Practically infinite counter for multiple writers 

Using our labeling scheme, we show how to implement a practically infinite 
counter supporting multiple writers. The idea is to extend the labeling scheme 
to handle counters, where a counter consists of a label, as used in the labeling 
scheme; an integer sequence number, ranging from 0 to 2 b , where b is large 
enough, say b = 64; and a processor id. Conceptually, if the system stabilizes 
to use a global maximal label, then the pair of the sequence number and the 
processor id (of this sequence number) can be used as an unbounded counter, 
as used, for example, in MWMR register implementations [18, 19]. Specifically, 
we say that counter cnt\ = (£\, seqn\,wid\) is smaller than counter cnt 2 = 
(£ 2 , seqn. 2 , wid 2 ) if {£\ -< £ 2 ) or ((£1 = £ 2 ) and ( seqni < seqn 2 )) or ((£1 = 
£ 2 ) and ( seqni = segn 2 ) and (wid\ < md 2 )). Note that when processors 
have the same label, the above relation forms a total ordering and processors 
can increment a shared counter also when attempting to do so concurrently. 
We argue that starting from any initial configuration, eventually the counter 
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algorithm supports such increments. 

The counter increment algorithm uses the same structures and procedures 
as the labeling algorithm, but now with counters instead of labels. To increment 
the counter, a processor p, first sends a request to all other processors querying 
the counter they consider as the global maximum and awaits for responses from 
a majority. Using a similar procedure as the labeling algorithm it (eventually) 
finds the maximal epoch label and the maximal sequence number it knows for 
this label. In other words, it collects counters and finds the counter(s) with 
the largest global label; there can be more than one such counter, in which 
case it returns the one with the highest sequence number, breaking symmetry 
with the sequence number processor identifiers. Then it checks whether this 
maximal sequence number is exhausted, that is, the sequence number is equal 
or larger than 2 64 (this could be, for example, due to the arbitrary values in 
the configuration the system starts in). When this is the case, it proceeds to 
find a new maximal label until it finds one that is not exhausted and uses the 
maximal sequence number it knows for this epoch label. Then the processor 
increments the sequence number by one, sets its identifier as the writer of the 
sequence number and sends the new counter to all processors, and awaits for 
acknowledgment from a majority (this is, in spirit, similar to the two-phase 
write operation of MWMR register implementations, focusing on the sequence 
number rather than on an associated value). 

Note that when a processor p.; establishes a new label £ as the global max¬ 
imum, it sets the corresponding counter cnt = (£,0,i); in this case, the label 
creator identifier and the sequence number writer identifier is i. When there is 
an already established maximal label i in the system and processor p,; wants to 
increment the counter, it increases the corresponding (to £) maximal sequence 
number found ( maxseqn ) by one, and sets the counter cnt = (l, maxseqn+1, i); 
in this case, it is possible that the label creator identifier and the sequence num¬ 
ber writer identifier are not the same, i.e., if p; was not the creator of label i. 
Also, note that some extra care is needed with respect to counter bookkeeping 
so as not to increase the size of the bounded histories used in the labeling al¬ 
gorithm. Having a counter increment algorithm, it is not difficult to obtain a 
practically self-stabilizing MWMR register implementation; counters are asso¬ 
ciated with values and the counter increment algorithm is run with this small 
amendment (more details in Sect. 4.3). 

2.3 Practically self-stabilizing virtual synchrony and Repli¬ 
cated state machine 

Our self-stabilizing Virtual Synchrony implementation combines the implemen¬ 
tation of the our counter algorithm and a self-stabilizing FIFO data link between 
any two participants; the latter is used to implement a self-stabilizing reliable 
multicast service and a self-stabilizing failure detector (used for the membership 
service). 
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Data link implementation. One version of a self-stabilizing FIFO data link 
implementation that we can use, is based on the fact that communication links 
have bounded capacity. Packets are retransmitted until more than the total 
capacity acknowledgments arrive; while acknowledgments are sent only when a 
packet arrives (not spontaneously) [11, 12]. Over this data-link, the two con¬ 
nected processors can constantly exchange a “token”. Specifically, the sender 
(possibly the processor with the highest identifier among the two) constantly 
sends packet 7Ti until it receives enough acknowledgments (more than the ca¬ 
pacity). Then, it constantly sends packet 7T2, and so on and so forth. This 
assures that the receiver has received packet before the sender starts sending 
packet 7T2. This can be viewed as a token exchange. We use the abstraction 
of the token carrying messages back and forth between any two communication 
entities. We use this token exchange technique when implementing a reliable 
multicast procedure, as well as a the basis for a heartbeat for detecting whether 
a processor is active or not; when a processor in no longer active, the token will 
not be returned back to the other processor. 

Reliable multicast implementation. As we will see next, we use a coordi¬ 
nator to exchange messages (by multicasting) within the group. The coordina¬ 
tor requests, collects and combines input from the group members, and then it 
multicasts the updated information. Specifically, when the coordinator decides 
to collect inputs, it waits for the token to arrive from each group participant. 
Whenever a token arrives from a participant, the coordinator uses the token to 
send the request for input to that participant, and waits the token to return 
with some input (possibly _L, when the participant does not have a new input). 
Once the coordinator receives an input from a certain participant with respect 
to this multicast invocation, the corresponding token will not carry any new re¬ 
quests to receive input from the same participant; of course, the tokens continue 
to move back and forth. When all inputs are received, the processor combines 
them and again uses the token to carry the updated information. Once this is 
done, the coordinator can proceed to the next input collection, when needed. 

Failure detector implementation. Every processor p maintains a heartbeat 
integer counter for every other processor q. Whenever processor p receives the 
token from processor q over their data link, processor p resets the counter’s 
value to zero and increments all the integer counters associated with the other 
processors by one, up to a predefined threshold value W. Once the heartbeat 
counter value of a processor q reaches W, the failure detector of processor p 
considers q as inactive. In other words, the failure detector at processor p 
considers processor q to be active, if and only if the heartbeat associated with q 
is strictly less than W. This is essentially the failure detector mentioned in [9]. 
Note that for the correctness of our virtual synchrony algorithm, we require a 
weaker failure detector. Specifically, we require that at least one processor is 
not suspected, for sufficiently long time, only by a majority of the processors, 
as opposed to an eventually perfect failure detector that ensures that after a 
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certain time, no active processor suspects any other active processor. 

Self-stabilizing virtual synchrony implementation. The algorithm is 
coordinator-based and we consider a primary-group implementation [6]. To as¬ 
sign view identifiers, we use our counter increment algorithm. Specifically, the 
view identifier is a triple that includes an epoch (label), the currently highest 
counter, cnt , which the counter algorithm obtains, and the processor that has 
created this counter, cnt.wid (writer), which is also the view coordinator. Note 
that this defines a simple interface with the counter algorithm, which provides 
an identical output. Furthermore, the view membership uses the output of the 
coordinator’s failure detector for defining the set of view members; this helps to 
maintain a consistent membership among the group members, despite inaccu¬ 
racies between the various failure detectors; as we show, this does not break the 
virtual synchrony property, as long as the majority-based failure detector prop¬ 
erty is preserved. Recall that the coordinator is responsible for the consistency 
of the multicast mechanism within the group. 

It may happen that the system reaches a configuration with no coordinator. 
For example, this could be the case in the arbitrary configuration that the 
system starts in, or in the case that the coordinator of an installed view is no 
longer active. Each participant that detects that it has no coordinator, seeks for 
potential candidates based on the exchanged information. A processor p regards 
a processor q as a candidate, if q is active according to p’s failure detector, and 
there is a majority of processors that also think so (all these are based on p’s 
knowledge, which due to asynchrony might not be up to date). When there is 
more than one such candidate, processor p checks whether there is a candidate 
that has proposed a higher counter among the candidates. If there is one, then 
p considers it to be the coordinator and waits to hear from it (or learn that it 
is not active). If there is none, and based on its knowledge there is a majority 
of processors that also do not have a coordinator, then processor p acquires a 
counter from the counter increment algorithm and proposes a new view, with 
view ID, the counter, and group membership, the set of processors that appear 
active according to its failure detector. As we show, if p receives an “accept” 
message from all the processors in the view, then it proceeds to install the 
view, unless another processor who has obtained a higher counter does so. In a 
transition from one view to the next, there can be several processors attempting 
to become the coordinator (namely, those who according to their knowledge 
have a supporting majority). Still, by exploiting the intersection property of 
the supporting majorities we prove that each of these processors will propose a 
view at most once, and out of these, one view will be installed (i.e., we do not 
have never-ending attempts for new views to be installed). 

The virtual synchrony property essentially requires that any two processors 
that participate in two consecutive groups should have delivered the same mes¬ 
sages. Roughly speaking, our algorithm preserves this property as follows: Once 
a processor does not have a coordinator, it stops participating in group mul¬ 
ticasting, and prior to delivering a new multicast message in a new view, the 
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algorithm assures that the coordinator of this new view has collected all the 
participants’ last delivered messages (in their prior views) and resends the mes¬ 
sages appearing not to have been delivered uniformly. To do so, each participant 
keeps the last delivered message and the view identifier that delivered this mes¬ 
sage. We show that this, together with the intersection property of majorities, 
(and after taking care of some subtle issues,) provides the virtual synchrony 
property. Starting from an arbitrary configuration, we show that if there is no 
valid coordinator, eventually a processor proposes a new view and, therefore, a 
valid coordinator is eventually elected. To assure this, processors continuously 
exchange through the failure detector’s token their coordinator’s identifier (or 
X if there’s no such). This helps to detect initially corrupted states when, say 
a processor pi might consider pj as its coordinator, but pj does not consider 
itself to be the coordinator. Combining the above with the self-stabilization of 
the counter increment algorithm, the data links, the failure detector and mul¬ 
ticast, we are able to guarantee reaching a legal execution in which the virtual 
synchrony property is always satisfied. 

Self-stabilizing replicate state machine implementation. Each partici¬ 
pant maintains a replica of the state machine and the last processed (compos¬ 
ite) message. Note that we bound the memory used to store the history of 
the replicated state machine by deciding to have the (encapsulated influence of 
the history represented by the) current state of the replicated state machine. 
In addition, each participant maintains the last delivered (composite) message 
to ensure common reliable multicast, as the coordinator may stop being active 
prior to ensuring that all members received a copy of the last multicast message. 
Whenever a new coordinator is elected, the coordinator inquires all members 
(forming a majority) for the most updated state and delivered message. Since 
at least one of the members, say pi , participated in the group in which the last 
completed state machine transition took place, pi s information will be recog¬ 
nized as associated with the largest counter, adopted by the coordinator that 
will in turn, assign the most updated state and available delivered message to 
all the current group members, in essence satisfying the virtual synchrony prop¬ 
erty. Then the coordinator, as part of the multicast procedure, collects inputs 
received from the environment before ensuring that all group members apply 
these inputs to the replica state machine. Note that the received multicast 
message consists of input (possibly _L) from each of the processors, thus, the 
processors need to apply one input at a time, the processors may apply them 
in an agreed upon sequential order, say from the input of the first processor to 
the last. Alternatively, the coordinator may request one input at a time in a 
round-robin fashion and multicast it. Finally, to ensure that the system stabi¬ 
lizes when started in an arbitrary configuration, every so often, the coordinator 
assigns the state of its replica to the other members. 

Perhaps some of the above ideas appear conceptually clear, however, there 
are low-level critical details that are essential to realizing them and prove them 
correct, as we are ready to describe. 



3 System Settings 

We consider an asynchronous message passing system as the one used in [1], 
The system includes a set P of n communicating processors; we refer to the 
processor with identifier i, as pi. We assume that up to a minority of processors 
may become inactive. We assume that the system runs on top of a stabiliz¬ 
ing data-link layer that provides reliable FIFO communication over unreliable 
bounded capacity channels [11, 12]. The network topology is of a fully con¬ 
nected graph where every two processors exchange (low-level messages called) 
packets to enable a reliable delivery of (high level) messages. When no confu¬ 
sion is possible we use the term messages for packets. The communication links 
have bounded capacity, so that the number of packets in every given instance is 
bounded by a constant. When processor pi sends a packet, pkt, to processor pj , 
the operation send inserts a copy of pkt to the FIFO queue that represents the 
communication channel from pi to pj , while respecting an upper bound on the 
number of packets in the channel, possibly omitting the new packet or one of the 
already sent packets. When pj receives pkt from pj , pkt is dequeued from the 
queue representing the channel. We assume that packets can be spontaneously 
omitted (lost) from the channel, however, a packet that is sent infinitely often 
is received infinitely often. 

The code of self-stabilizing algorithms usually consists of a do forever loop 
that contains communication operations with the neighbors and validation that 
the system is in a consistent state as part of the transition decision. An iteration 
is said to be complete if it starts in the loop’s first line and ends at the last 
(regardless of whether it enters branches). 

Every processor, p,;, executes a program that is a sequence of (atomic) steps , 
where a step starts with local computations and ends with a single commu¬ 
nication operation, which is either send or receive of a packet. For ease of 
description, we assume the interleaving model, where steps are executed atom¬ 
ically, a single step at any given time. An input event can be either the receipt 
of a packet or a periodic timer triggering p* to (re)send. Note that the system 
is asynchronous and the rate of the timer is totally unknown. 

The state, Si , of a node Pi consists of the value of all the variables of the node 
including the set of all incoming communication channels. The execution of an 
algorithm step can change the node’s state. The term (system) configuration 
is used for a tuple of the form (si,S 2 ,-- - ,s n ), where each Si is the state of 
node pi (including messages in transit for pf). We define an execution (or run) 
R = Co, do, Ci, cti, ... as an alternating sequence of system configurations c x and 
steps a x , such that each configuration c x + 1 , except the initial configuration Co, 
is obtained from the preceding configuration c x by the execution of the step a x . 
A practically infinite execution is an execution with many steps (and iterations), 
where many is defined to be proportional to the time it takes to execute a step 
and the life-span time of a system. 

We define the system’s task by a set of executions called legal executions 
( LE ) in which the task’s requirements hold, we use the term safe configuration 
for any configuration in LE. An algorithm is self-stabilizing with relation to 
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Vi={Pi,p 4 ,P 5 } V 2 ={p 1 ,p 2 ,p 3 ,p 4 } v 3 ={p 1 ,p 2 ,p 3 ,p 4 } v 4 ={p 2 ,p 3 ,p 4 } 


Figure 1 : An execution satisfying the VS property. The grey boxes indicate a new 
view installation, and the example shows four views. View Vi initially with member¬ 
ship {pi,p4,ps}. The reliable multicast reaches all members of the group. Two new 
processors p2 and P3 join the group, forming view V2 . In this view, ps crashes before 
completing its multicast which is ignored (dashed lines). The new view V3 is formed to 
exclude ps, and in it, pi manages a successful multicast before crashing. The multicast 
of P3 is reliable and guaranteed to be delivered to all non-crashed within the view, that 
is excluding p 3 which might or might not have received it (dotted line). A new view 
is then formed to encapture the failure of pi. 


the task LE when every (unbounded) execution of the algorithm reaches a safe 
configuration with relation to the algorithm and the task. An algorithm is 
practically stabilizing with relation to the task LE if in any practically infinite 
execution a safe configuration is reached. 

The virtual synchrony task requires that any two processors that share the 
same sequence of views, ought to deliver the same identical message sets in these 
views. The legal execution of virtual synchrony is defined in terms of the input 
and output sequences of the system with the environment. When a majority of 
processors are continuously active every external input (and only the external 
inputs) should be atomically accepted and processed by the majority of the 
active processors. Note that there is no delivery and processing guarantee in 
executions in which there is no majority, still in these executions any delivery 
and processing is due to a received environment input. An exemplar virtually 
synchronous execution can be found in Figure 1. 

Notation. Throughout the paper we use the following notation. Let y and y' 
be two objects that both include the field x. We denote (y = x y') = ( y.x = 
y'.x). 

4 Self-stabilizing Labeling Scheme and Counter 
Algorithm 

In this section, we first present and prove correct of the proposed self-stabilizing 
labeling algorithm and then explain how this can be extended to implement 
self-stabilizing practically unbounded counters in Section 4.3. 
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Algorithm 1: The nextLabelQ function; code for pi 

1 For any non-empty set X C D, function pick(d, A) returns d arbitrary elements 
of X; 

input : S = (£i ,£2 ■ ■ ■ ,£k) set of k labels, 
output : (i, newSting, new Antistings) 

2 let new Antistings = {£j. sting : lj £ 5}; 

3 new Antistings <— new Antistings U pick(k — \new Antisting s\, 

D \ new Antistings)', 

4 return 

(i,pick( 1, D \ (new Antistings U {U e- e s£j-Antistings})), new Antistings); 


4.1 Labeling Algorithm for Concurrent Label Creations 

4.1.1 Bounded Labeling Scheme 

We extend the labeling scheme of [1] to support wait-free multi-writer systems. 
We do so, by extending the label with a label creator’s identifier, so as to break 
symmetry and decide about the most recent epoch even when two or more 
writers concurrently attempt to create a new label. 

Specifically, we consider the set of integers D = [1, k 2 + 1]. A label (or 
epoch) is a triple {[Creator , sting, Anti sting s) , where ICreator is the identity 
of the processor that established (created) the label, Antistings C D with 
\Antistings\ = k , and sting € D. Given two labels £%,£j, we define the rela¬ 
tion -<u, £j = ((.i.lCreator < ij.lCreator) V (C-lCreator = £j.lCreator A 
(( ti.sting € £j. Anti stings) A (£j. sting ^ L- Anti stings))); we use =ib to say 
that the labels are identical. Note that the relation -<n, does not define a total 
order. For example, when =iCreator £j and (t). sting ^ £j. Anti stings) and 
(£j. sting C Antisting) these labels are incomparable. As in [1], we demon¬ 
strate that one can still use this labeling scheme as long as it is ensured that 
eventually a label greater than all other labels in the system is introduced. We 
say that a label £ cancels another label £', either if they are incomparable or 
they have the same ICreator but £ is greater than £' (with respect to sting and 
Antistings). A label with creator pi is said to belong to p{ s domain. 

Function nextLabel(), Algorithm 1, gets a set of at most k labels as input 
and returns a new label that is greater than all of the labels of the input. It has 
the same functionality as the function called Nextb() in [1], but it additionally 
considers the label creator. The function essentially composes a new Antistings 
set from the stings of all the labels it has as input, and chooses a sting that 
is in none of the Antistings of the input labels. In this way it ensures that 
the new label is greater than any of the input. Note that the function takes 
k Antistings of k labels that are not necessarily distinct, implying at most k 2 
distinct integers and thus the choice of \D\ = k 2 + 1 allows to always obtain a 
greater integer as the sting. 
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4.1.2 The Labeling Algorithm 

The labeling algorithm (Algorithm 2) specifies how the processors exchange their 
label information in the asynchronous system and how they maintain proper 
label bookkeeping so as to “discover” their greatest label and cancel all obsolete 
ones. As we will be using pairs of labels with the same label creator, for the 
ease of presentation, we will be referring to these two variables as the (label) 
pair. The first label in a pair is called ml. The second label is called cl and it 
is either _L, or equal to a label that cancels ml (i.e., cl indicates whether ml is 
an obsolete label or not). 

The processor state. Each processor stores an array of label pairs, maXi[n], 
where rnax-i [i] refers to pd s maximal label pair and maxi[j} considers the most 
recent value that pi knows about pj’s pair. Processor pi also stores the pairs of 
the nrost-recently-used labels in the array of queues storedLabelsi[n\. The j-th 
entry refers to the queue with pairs from pj's domain, i.e., that were created 
by pj. The algorithm makes sure that storedLabelsdj] includes only label pairs 
with unique ml from pj's domain and that at most one of them is legitimate, 
i.e., not canceled. Queues storedLabelsi[j] for i ^ j, have size n + m whilst 
storedLabelSi[i] has size 2 (mn + 2n 2 — 2 n) where m is the system’s total link 
capacity in labels. We later show (c.f. Lemmas 4.3 and 4.4) that these queue 
sizes are sufficient to prevent overflows of useful labels. 

Information exchange between processors. Processor pi takes a step 
whenever it receives two pairs (sentMax, lastSent) from some other proces¬ 
sor. We note that in a legal execution pj \s pair includes both sentMax, which 
refers to pf s maximal label pair rnaX j [j] , and lastSent, which refers to a recent 
label pair that Pj received from pi about pd s maximal label, maxj[i\ (line 16). 

Whenever a processor pj sends a pair ( sentMax, lastSent) to pi, this proces¬ 
sor stores the value of the arriving sentMax field in maxi\j] (line 19). However, 
Pj may have local knowledge of a label from pd s domain that cancels pd s max¬ 
imal label, ml, of the last received sentMax from pi to pj that was stored in 
maxj [i]. Then pj needs to communicate this canceling label in its next commu¬ 
nication to pi. To this end, pj assigns this canceling label to maxj[i\.cl which 
stops being _L. Then pj transmits m.axj(i\ to pi as a lastSent label pair, and 
this satisfies lastSent.cl lastSent.ml, i.e., lastSent.cl is either greater or 

incomparable to lastSent.ml. This makes lastSent illegitimate and in case this 
still refers to pfs current maximal label, Pi must cancel m.aXi[i] by assigning it 
with lastSent (and thus maXi{i].cl = lastSent.cl) as done in line 20. Processor 
Pi then processes the two pairs received (lines 21 to 28). 

Label processing. Processor pi takes a step whenever it receives a new pair 
message ( sentMax, lastSent) from processor pj (line 17). Each such step 
starts by removing stale information, i.e., misplaced or doubly represented la¬ 
bels (line 9). In the case that stale information exists, the algorithm empties 
the entire label storage. Processor pi then tests whether the arriving two pairs 
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Algorithm 2: Self-Stabilizing Labeling Algorithm; code for p t 
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16 
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23 

24 
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28 


Variables: 

max[n\ of (ml, cl): max[i\ is pi’s largest label pair, max[j] refers to pj’s label pair 
(canceled when max[j].cl 7 ^ _L). 

storedLabels[n\: an array of queues of the most-recently-used label pairs, where 
storedLabels[j) holds the labels created by pj £ P. For pj £ (P \ {p*}), storedLabels[j\s 
queue size is limited to (n + m) w.r.t. label pairs, where n = |P| is the number of processors 
in the system and m is the maximum number of label pairs that can be in transit in the 
system. The storedLabels[iY s queue size is limited to (n(n 2 +m)) pairs. The operator 
add(£) adds Ip to the front of the queue, and emptyAllQueues() clears all storedLabels[] 
queues. We use lp.remove () for removing the record Ip £ storedLabels []. Note that an 
element is brought to the queue front every time this element is accessed in the queue. 
Notation: Let y and y be two records that include the field x. We denote y = x y = (y.x 

= y'-x) 

Macros: 

legit(lp) = (Ip = (•, _L>) 

labels(lp) : return ( storedLabels[lp.ml.ICreator ]) 

double(j,lp) = (Sip' £ storedLabels[j] : ((Ip 7 ^ lp')A((lp =mi lp')V (legit(lp) Alegit(lp'))))) 
stalelnf o( ) = (Spj £ P, Ip £ storedLabels[j ] : (Ip 7 ^ ICreator j) V double(j, Ip)) 
recordDoesntExist(j) = ({max[j].ml, •) ^ labels(max[j])) 

notgeq(j,lp) = if (Sip' £ storedLabels[j] : (Ip .ml -£ib Ip.ml)) then return(Zp .ml) 
else return(_L) 

canceled(lp) = if (Sip' £ labels(lp) : ((Ip =mi Ip) A —>legit(lp ))) then return(Zp ) 
else return(_L) 

needsUpdate(j) = (- 1 legit(max[j ]) A (max[j].ml, _L) £ labels(max[j])) 
legitLabels() = {max[j].ml : Spj £ PA legit(max[j])} 
useOwnLabel() = if (Sip £ storedLabels[i\ : legit(lp)) then max[i] <— Ip 
else storedLabels[i\.add(max[i\ <— (nextLabel(), - L)) // For every 
Ip £ storedLabels[i\ , we pass in nextLabel() both Ip.ml and Ip.cl. 
upon tr an smit Ready (pj £ P \ {pi}) do transmit ( (max [i\, max[j])) 
upon receive((sent Max, last Sent)) from pj 
begin 

max[j] <— sentMax; 

if ~'legit(lastSent) A max[i\ = rri i lastSent then max[i\ <— lastSent 
if stalelnfo() then storedLabels.emptyAllQueues() 
foreach pj £ P : recordDoesntExist(j) do labels(max[j]).add(max[j]) 
foreach pj £ P, Ip £ storedLabels[j] : (legit(lp) A (notgeq(j, Ip) 7 ^ _L)) do 
Ip.cl <— notgeq(j, Ip) 

foreach pj £ P, Ip £ labels(max[j]) : (—>legit(max[j]) A (max[j] =mi Ip) A legit(lp)) do 
Ip <— max[j] 

foreach pj £ P, Ip £ storedLabels[j] : double(j,lp) do lp.remove() 
foreach pj £ P : (legit(max[j)) A (canceled(max[j]) 7 ^ _L)) do 
max[j] <— canceled(max[j]) 

if legit Label s() 7 ^ 0 then max[i] <— (max_< Zb (legit Label s{)), _L) 
else useOwnLabel() 


are already included in the label storage (storedLabels \\), otherwise it includes 
them (line 22). The algorithm continues to see whether, based on the new pairs 
added to the label storage, it is possible to cancel a non-canceled label pair 
(which may well be the newly added pair). In this case, the algorithm updates 
the canceling Held of any label pair Ip (line 23) with the canceling label of a label 
pair Ip 1 such that Ip 1 .ml Ip.ml (line 23). It is implied that since the two pairs 
belong to the same storage queue, they have the same processor as creator. The 
algorithm then checks whether any pair of the maxi [] array can cause canceling 
to a record in the label storage (line 24), and also line 25 removes any canceled 
records that share the same creator identifier. The test also considers the case 
in which the above update may cancel any arriving label in max[j] and updates 
this entry accordingly based on stored pairs (line 26). 
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After this series of tests and updates, the algorithm is ready to decide upon a 
maximal label based on its local information. This is the ^/b-greatest legit label 
pair among all the ones in maxi{} (line 27). When no such legit label exists, p, 
requests a legit label in its own label storage, storedLabelsi[i], and if one does 
not exist, will create a new one if needed (line 28). This is done by passing the 
labels in the storedLabelsi[i\ queue to the nextLabel() function. Note that the 
returned label is coupled with a _L and the resulting label pair is added to both 
maxi[i\ and storedLabeli [z]. 

4.2 Correctness proof 

We are now ready to show the correctness of the algorithm. We begin with a 
proof overview. 

Overview of the proof. The proof considers a execution R of Algorithm 2 
that may initiate in an arbitrary configuration (and include a processor that 
takes practically infinite number of steps). It starts by showing some basic 
facts, such as: (1) stale information is removed, i.e., storedLabelsi[j} includes 
only unique copies ofp/s labels, and at most one legitimate such label (Corol¬ 
lary 4.1), and (2) p t either adopts or creates the ^-greatest legitimate local 
label (Lemma 4.2). The proof then presents bounds on the number adoption 
steps (Lemmas 4.3 and 4.4), that define the required queue sizes to avoid label 
overflows. 

The proof continues to show that active processors can eventually stop adopt¬ 
ing or creating labels, by tackling individual cases where canceled or incompa¬ 
rable label pairs may cause a change of the local maximal label. We show 
that such labels eventually disappear from the system (Lemma 4.5) and thus 
no new labels are being adopted or created (Lemma 4.6), which then implies 
the existence of a global maximal label (Lemma 4.7). Namely, there is a legit¬ 
imate label £ max , such that for any processor pi £ P (that takes a practically 
infinite number of steps in R), it holds that maxi[i\ = £ max . Moreover, for 
any processor pj £ P that is active throughout the execution, it holds that 
Pi's local maximal label maXi[i] = £ m ax is the r^b-greatest of all the labels 
in maXiW and there is no label pair in storedLabelsi[j] that cancels £ m ax , i.e., 
((maxi\j] <i b 4 i ax ) A ((W £ storedLabelsi\j ]) : legit{£)) => {£ <ib 4 a*))). We 
then demonstrate that, when starting from an initial arbitrary configuration, the 
system eventually reaches a configuration in which there is a global maximal 
label (Theorem 4.2). 

Before we present the proof in detail, we provide some helpful definitions and 
notation. 

Definitions. We define H to be the set of all label pairs that can be in transit 
in the system, with \H\ = m. So in an arbitrary configuration, there can be 
up to m corrupted label pairs in the system’s links. We also denote LLj as 
the set of label pairs that are in transit from processor pi to processor pj. The 
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number of label pairs in Tlij obeys the link capacity bound. Recall that the 
data structures used (e.g., 7 naXi[], storedLabelsQ ], etc) store label pairs. For 
convenience of presentation and when clear from the context, we may refer to 
the ml part of the label pair as “the label”. 

4.2.1 No stale information 

Lemma 4.1 says that the predicate stalelnfoQ (line 9) can only hold during 
the first execution of the receiveQ event (line 17). 

Lemma 4.1 Let pi € P be a processor for which -•stalelnfoiQ (line 9) does 
not hold during the k-th step in R that includes the complete execution of the 
receiveQ event (from line 17 to 28). Then k = 1. 

Proof. Since R starts in an arbitrary configuration, there could be a queue in 
storedLabelSi[} that holds two label records from the same creator, a label that 
is not stored according to its creator identifier, or more than one legitimate label. 
Therefore, stalelnfoiQ might hold during the first execution of the receiveQ 
event. When this is the case, the storedLabels.i[] structure is emptied (line 21). 
During that receiveQ event execution (and any event execution after this), p, 
adds records to a queue in storedLabelsif) (according to the creator identifier) 
only after checking whether recordDoesntExistQ holds (line 22). 

Any other access to storedLabelsQ] merely updates cancelations or removes 
duplicates. Namely, canceling labels that are not the ^-greatest among the 
ones that share the same creating processors (line 23) and canceling records 
that were canceled by other processors (line 24), as well as removing legitimate 
records that share the same ml (line 25). It is, therefore, clear that in any 
subsequent iteration of receiveQ (after the first), stalelnfoQ cannot hold. ■ 
Lemma 4.1 along with the lines 9 and 26 of the Algorithm, imply Corol¬ 
lary 4.1. 

Corollary 4.1 Consider a suffix R' of execution R that starts after the execu¬ 
tion of a receiveQ event. Then the following hold throughout R': (i)Vpi,Pj € P, 
the state ofpi encodes at most one legitimate label, £j = icreat.or j and (ii) lj can 
only appear in storedLabelsQj] and maxf) but not in storedLabelsi[k\ : k ^ j. 

4.2.2 Local ^b-greatest legitimate local label 

Lemma 4.2 considers processors for which stalelnfoQ (line 9) does not hold. 
Note that stalelnfoQ holds at any time after the first step that includes the 
receiveQ event (Lemma 4.1). Lemma 4.2 shows that pi either adopts or creates 
the ^ib-greatest legitimate local label and stores it in maxi[i]. 

Lemma 4.2 Let pi £ P be a processor such that ->stalelnfoiQ (line 9), 
and L pre (i) = {maxi{j].ml : 3pj £ P A legit(maXi[j\) A (3(maxi\j].ml, x) £ 
(labels(maxQj]) \ {maxi\j]}) => (x = J_))} be the set of maaq[] ’s labels 
that, before Pi executes lines 21 to 28, are legitimate both in maXi[] and in 
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storecLLabelSi[]’s queues. Let L post (i) = {maxi\j].ml : 3pj € P Alegit(maxi[j])} 
and (£, _L) be the value of maxi [?'] immediately after pi executes lines 21 to 28. 
The label {£, _L) is the <ib~greatest legitimate label in L post (i). Moreover, suppose 
that L. pre (i ) has a ^^-greatest legitimate label, then that label is (f,_L). 

Proof. (£, _L) is the ^ib-greatest legitimate label in L pos t(i). Sup¬ 
pose that immediately before line 27, we have that legitLabelSiQ ^ 0, where 
legitLabelSi () = {maXi[j].ml : 3pj € P A legit(maXi[j])} (line 14). Note that 
in this case L post (i) = legitLabelSiQ. By the definition of if^-greatest legiti¬ 
mate label and line 27, maXi[i\ = {£,3-} is the ^-greatest legitimate label in 
L post [i )• Suppose that legitLabelSif) = 0 immediately before line 27, i.e., there 
are no legitimate labels in {maxi[j] : 3pj £ P}. By the definition of ^z^-greatest 
legitimate label and line 15, max,;[i] = (f,_L) is the ^-greatest legitimate label 
in L post (i). 

Suppose that rec = (£', _L) is a ^ib-greatest legitimate label in L pre (i), 
then £ = £'. We show that the record rec is not modified in maxi\\ until the 
end of the execution of lines 21 to 28. Moreover, the records that are modified 
in maXi [], are not included in L pre {i) (it is canceled in storedLabelSi \]) and 
no records in maXi\\ become legitimate. Therefore, rec is also the ^zft-greatest 
legitimate label in L post (i), and thus, £ = £'. 

Since we assume that stalelnfoif) does not hold, line 21 does not mod¬ 
ify rec. Lines 22, 23 and 25 might add, modify, and respectively, remove 
storedLabelSi s records, but it does not modify maxi[]. Since rec is not canceled 
in storedLabels-i[] and the ^zfe-greatest legitimate label in maXi\\, the predicate 
(legit(max[j]) Anotgeq(j)) does not hold and line 23 does not modify rec. More¬ 
over, the records in inaXi[] , for which that predicate holds, become illegitimate. 


4.2.3 Bounding the number of labels 

Lemmas 4.3 and 4.4 present bounds on the number of adoption steps. These 
are n -I- m for labels by labels that become inactive in any point in R and 
(mn + 2n 2 — 2 n) for any active processor. Following the above, choosing the 
queue sizes as n + m for storedLabelSi [j] if * j, and 2 (nm + 2 n 2 — 2n) + 1 for 
storedLabelSi[i] is sufficient to prevent overflows given that m is the system’s 
total link capacity in labels. 

Maximum number of label adoptions in the absence of creations. 

Suppose that there exists a processor, pj, that has stopped adding labels to 
the system (the else part of line 28), say, because it became inactive (crashed), 
or it names a maximal label that is the ^zb-greatest label among all the ones 
that the network ever delivers to pj. Lemma 4.3 bounds the number of labels 
from pj ’s domain that any processor p t £ P adopts in R. 

Lemma 4.3 Let Pi,Pj £ P, be two processors. Suppose that pj has stopped 
adding labels to the system configuration (the else part of line 28), and sending 
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(line 16) these labels during R. Processor pi adopts (line 27) at most (n + m) 
labels, ij : (ij =iCreator j) , from pj’s unknown domain (ij labelsi(ij)) where 
to is the maximum number of label pairs that can be in transit in the system. 

Proof. Let pk €E P. At any time (after the first step in R) processor pQ s 
state encodes at most one legitimate label, ij, for which ij =icreator j (Corol¬ 
lary 4.1). Whenever pi adopts a new label ij from pf's domain (line 27) such 
that ij : (ij —iCreator j) , this implies that ij is the only legitimate label pair in 
storedLabelSi\j]. Since ij was not transmitted by pj before it was adopted, ij 
must come from pQ s state delivered by a transmit event (line 16) or delivered 
via the network as part of the set of labels that existed in the initial arbitrary 
state. The bound holds since there are n processors, such as pk , and to bounds 
the number of labels in transit. Moreover, no other processor can create label 
pairs from the domain of pj. ■ 

Maximum number of label creations. Lemma 4.4 shows a bound on the 
number of adoption steps that does not depend on whether the labels are from 
the domain of an active or (eventually) inactive processor. 

Lemma 4.4 Let pi £ P and Li = £ io ,£ ix , ... be the sequence of legitimate 
labels, ii k —iCreator i, from pi’s domain, which pi stores in maxi[i\ through the 
reception (line 17) or creation of labels (line 28), where k £ N. It holds that 
\L) < n(n 2 + to). 

Proof. Let Ljj = ip.jfru ,j, • ■ ■ be the sequence of legitimate labels that p, 
stores in maxi\j] during R and Cij = i\ Q j ,... be the sequence of legitimate 
labels that Pi receives from processor pj's domain. We consider the following 
cases in which pi stores L’s values in maxQi]. 

(1) When i- lk = where Pj,Pj' € P and kg N. This case considers 

the situation in which maxfri} stores a label that appeared in maxj\f] at the 
(arbitrary) starting configuration, (i.e. !j 0 ,j> € There are at most n(n—1) 

such legitimate label values from pf s domain, namely n — 1 arrays maxj [] of 
size n. 

(2) When i ik = i ik , j- = i? oj ,, where Pj , p y <E P, k, k' G N and f jk ,,y j- f jk ,,j. 
This case considers the situation in which maxfri] stores a label that appeared 
in the communication channel between pj and pj / at the (arbitrary) starting 
configuration, (i.e. i) a ■, £ Cjji ) and appeared in maXj[j'} before Pj communi¬ 
cated this to Pi. There are at most to such values, i.e., as many as the capacity 
of the communication links in labels, namely \H\. 

(3) When £i k is the return value of nextLabelQ (the else part of 
line 28). Processor p t aims at adopting the ^-greatest legitimate label 
that is stored in maxi\\, whenever such exists (line 27). Otherwise, pi uses a label 
from its domain; either one that is the ^6-greatest legit label among the ones 
in storedLabelsi[i\, whenever such exists, or the returned value of nextLabelQ 
(line 28). 

The latter case (the else part of line 28) refers to labels, ii k , that pi stores in 
maxi[i] only after checking that there are no legitimate labels stored in maxi\\ 
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or storedLabelsi[i\. Note that every time p t executes the else part of line 28, 
Pi stores the returned label, £j k , in storedLabels.Qi]. After that, there are only 
three events for £i k not to be stored as a legitimate label in storedLabelSi[i ]: 

(i) execution of line 21, (zz) the network delivers to Pi a label, £', that either 
cancels Q k or for which £' 2 ^ £ ik , and (Hi) £i k overflows from storedLabelsi[i] 
after exceeding the ( n(n 2 + m) + 1) limit which is the size of the queue. 

Note that Lemma 4.1 says that event (z) can occur only once (during pi's 
first step). Moreover, only pi can generate labels that are associated with its 
domain (in the else part of line 28). Each such label is ^^-greater-equal than all 
the ones in storedLabelsi[i] (by the definition of nextLabelQ in Algorithm 1). 

Event (zz) cannot occur after pi has learned all the labels t £ remoteLabelSi 
for which £ £ storedLabelsi[i\ , where remoteLabelSi = (((U Pj£ p localLabelSij) 
U H) \ storedLabelsi[i ]) and localLabelSij = {£' : £' =iCreator z, 3 pj £ P : ((£' £ 
storedLabels[i}) V (3 pk £ P : £! = maXj[k\.ml))}. During this learning process, 
Pi cancels or updates the cancellation labels in storedLabels.Qi] before adding a 
new legitimate label. Thus, this learning process can be seen as moving labels 
from remoteLabelSi to storedLabelsQi ] and then keeping at most one legitimate 
label available in storedLabelsi[i]. Every time storedLabelsi[i] accumulates a 
label £ that was unknown to Pi, the use of nextLabelQ allows it to create a label 
£i k that is 2u;,-greater than any label in storedLabels.i[i\ and eventually from all 
the ones in remoteLabelSi. 

Note that remoteLabelsQ s labels must come from the (arbitrary) start of 
the system, because p t is the only one that can add a label to the system from 
its domain and therefore this set cannot increase in size. These labels include 
those that are in transit in the system and all those that are unknown to pi 
but exist in the rruiXj [•] or storedLabelsQi ] structures of some other processor 
Pj. By Lemma 4.3 we know that \storedLabelsj[i]\ < n + to for i ^ j. From 
the three cases of L t labels that we detailed at the beginning of this proof ((1)- 
(3)), we can bound the size of remoteLabelSi as follows: for pj £ P : j ^ i 
we have that \remoteLabelsi\ = (n — l)(|maa;[]| + \storedLabelSj[i\\) + \H\ = 
(n — l)(n + (n + m)) + m = mn + 2n 2 — 2n. Since Pi may respond to each 
of these labels with a call to nextLabelQ, we require that storedLabelsi[i] has 
size 2\remoteLabelsi\ +1 label pairs in order to be able to accommodate all the 
labels from \remoteLabelsi\ and the ones created in response to these, plus the 
current greatest. Thus, what is suggested by event (zz) of pi, i.e., receiving labels 
from remoteLabelSi, stops happening before overflows (event (Hi)) occurs, since 
storedLabels.Qi] has been chosen to have a size that can accommodate all the 
labels from remoteLabelSi and those created by p t as a response to these. This 
size is 2 (mn + 2n 2 — 2 n) + 1 which is 0 (n 3 ). ■ 

4.2.4 Pair diffusion 

The proof continues and shows that active processors can eventually stop adopt¬ 
ing or creating labels. We are particularly interested in looking into cases in 
which there are canceled label pairs and incomparable ones. We show that they 
eventually disappear from the system (Lemma 4.5) and thus no new labels are 
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Notation 

Definition 

Remark 

hNameij^k 

{(tj, tk) ■ tj = maxj[j ] A (3(tk, • ) 6 ' Hk,j )} 

In transit from pk to pj 
as sentMax feedback about 
maxk [&] 

hAckij t k 

{(tj,tk) ■■ tj = maxj[k\ A (3 (•,tk) Eft,,)} 

In transit from pk to pj 
as lastSent feedback about 
maxk [j] 

maxij^k 

{(maxj [j ], maxk [&])} 

Local maximal labels of pj and 
Pk 

ackij^k 

{(maxj[j],maxk\j])} 

ij is pj ’s local maximal label 
and ik — maxk[j] 

storedij^k 

{{'mO'Xj[j]} X storedLabelsk [i]} 

A label tk in storedLabelsk [*] 
that can cancel tj = maxj [j] 


Table 1: The notation used to identify the possible positions of label pairs tj 
and £k that can cause canceling as used in Lemmas 4.5 to 4.7 and in 
Theorem 4.2. 


being adopted or created (Lemma 4.6), which then implies the existence of a 
global maximal label (Lemma 4.7). 

Lemmas 4.5 and 4.6, as well as Lemma 4.7 and Theorem 4.2 assume the 
existence of at least one processor, p un known € P whose identity is unknown, 
that takes practically infinite number of steps in R. Suppose that processor 
Pi £ P takes a bounded number of steps in R during a period in which p un known 
takes a practically infinite number of steps. We say that pi has become inactive 
(crashed) during that period and assume that it does not resume to take steps 
at any later stage of R (in the manner of fail-stop failures, as in Section 3). 

Consider a processor p l £ P that takes any number of (bounded or practi¬ 
cally infinite) steps in R and two processors Pj,Pk € P that take a practically 
infinite number of steps in R. Given that pj has a label pair £ as its local max¬ 
imal, and there exists another label pair £! such that £! £ and they have 

the same creator p,;. Algorithm 2 suggests only two possible routes for some 
label pair £' to find its way in the system through pj. Either by Pj adopting 
£! (line 27), or by creating it as a new label (the else part of line 28). Note, 
however, that pj is not allowed to create a label in the name of pi and since 
(! —iCreator A the only way for (! to disturb the system is if this is adopted by 
Pj as in line 27. We use the following definitions for estimating whether there 
are such label pairs as £ and £' in the system. 

There is a risk for two label pairs from p ^s domain, £j and £/-, to cause such 
a disturbance when either they cancel one another or when it can be found that 
one is not greater than the other. Thus, we use the predicate riski t j^{£j,£k) = 
(£j =i £k)/\legit(£j)A(notGreater(£j, £k)Vcanceled(£j, £k)) to estimate whether 
Pj 's state encodes a label pair, £j =icreator A from /vs domain that may disturb 
the system due to another label, £k, from p^ s domain that pk s state encodes, 
where canceled(£j,£k) = ( legit(£j) A ~^legit(£k) A £j = m i £k) refers to a case in 
which label £j is canceled by label £k , notGreater(£j , £k) = (legit(£j)Alegit{£k)A 
£k £j) that refers to a case in which label £k is not ^{,-greater than £j and 

(£j ~i £-k ) = (£;j —ICreator £-k —ICreator ^)* 

These two label pairs, £j and £k, can be the ones that processors pj and pk 
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name as their local maximal label, as in maxij t k = {(Tnaxj\j],maxk[k])}, or 
recently received from one another, as in ackij t k = {(maxj\j\,maxk\j])}- These 
two cases also appear when considering the communication channel (or buffers) 
from pk to pj, as in hNamejj^ = {{ij,ik) '■ = maxj[j ] A (3(^,*) £ TLk,j)} 

and hAckij^k = {{ij,ik) ■ ij — maXj[k\ A (3(*,£ fc ) £ 'Hkj )}• We also note the 
case in which pk stores a label pair that might disturb the one that pj names as 
its (local) maximal, as in storedij = {{maxj[j]}x storedLabelsk[i]} We define 
the union of these cases to be the set risk = {(ij,ik) G maXij t k U ackij,k U 
hName.ij’k U hAckij t k U storedij t k ■ ^Pi,Pj,Pk G P A stopped) A stoppedk A 
riskij t k(£j,£k)}i where stoppedi = true when processor p. t is inactive (crashed) 
and false otherwise. The above notation can also be found in Table 1. 

Lemma 4.5 Suppose that there exists at least one processor, p U nknown G P 
whose identity is unknown, that takes practically infinite number of steps in R 
during a period where pj never adopts labels (line 27), ij : (ij =iCreator i), from 
Pi’s unknown domain (ij ^ labelsj(ij))- Then eventually risk = 0 . 

Proof. Suppose this Lemma is false, i.e., the assumptions of this Lemmahold 
and yet in any configuration c £ R, it holds that (ij,ik) €E risk ^ 0. We use 
risk’s definition to study the different cases. By the definition of risk, we can 
assume, without the loss of generality, that p 3 and pk are alive throughout R. 

Claim: If p 3 and pk are alive throughout R, i.e. stoppedj = stoppedk = 
False, then risk 0 riskij t k = True. This means that there exist 

two label pairs ( ij,ik ) where ik can force a cancellation to occur. Then 
the only way for this two labels to force risk ^ 0 is if, throughout the 
execution, ik never reaches pj. 

The above claim is verified by a simple observation of the algorithm. If ik 
reaches p 3 then lines 20, 24 and 26 guarantee a canceling and lines 22 and 23 
ensure that these labels are kept canceled inside storedLabelSj\\. The latter is 
also ensured by the bounds on the labels given in Lemmas 4.3 and 4.4 that do 
not allow queue overflows. Thus to include these two labels to risk , is to keep 
ik hidden from p 3 throughout R. We perform a case-by-case analysis to show 
that it is impossible for label ik to be “hidden” from pj for an infinite number 
of steps in R. 

The case of ( ij,ik) G hN ameij^. This is the case where ij = maXj[j] and 
ik is a label in TLk,j that appears to be rnaXk [k]. This may also contain such 
labels from the corrupt state. We note that pj and pk are alive throughout R. 
The stabilizing implementation of the data-link ensures that a message cannot 
reside in the communication channel during an infinite number of transmiti)) - 
receiveQ events of the two ends. Thus ik, which may well have only a single 
instance in the link coming from the initial corrupt state, will either eventually 
reach pj or it become lost. In the both cases (the first by the Claim for the 
second trivially) the two clashing labels are removed from risk and the result 
follows. 

The case of ( ij,ik) G hAckjj t k- This is the case where ij = maXj\j) and 
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£k is a label in Hk.j that appears to be maxk[j]- The proof line is exactly the 
same as the previous case. 

This case follows by the same arguments to the case of {£j,£k) G ackij t k- 
The case of (£j,£k) G Here the label pairs £j and l\. are named 

by pj and pk as their local maximal label. We note that pj and Pk are alive 
throughout R. By our self-stabilizing data-links and by the assumption on the 
communication that a message sent infinitely often is received infinitely often, 
then pk transmits its maxk[k] label infinitely often when executing line 16. This 
implies that pj receives £k infinitely often. By the Claim the canceling takes 
place, and the two labels are eventually removed from the global observer’s risk 
set, giving a contradiction. 

The case of (fj, fk) G acki,j,k- This is the case case where the labels (£j,£k) 

belong to {(maXj[j],maXk[j])}- Since processor pk continuously transmits its 
label pair in maxk[j] (line 16) the proof is almost identical to the previous case. 

The case of (£$,£if) G storedy.k- This case’s proof, follows by similar 
arguments to the case of (£j,£k) G maxij^- Namely, pk eventually receives 
the label pair £ 3 = maXj[j ]. The assumption that riskij : k(£j,£k ) holds implies 
that one of the tests in lines 23 and 26 will either update storedLabelsk[i\, and 
respectively, maxk\j] with canceling values. We note that for the latter case we 
argue that p 3 eventually received the canceled label pair in maXk [j] , because we 
assume that p 3 does not change the value of max 3 [j] throughout R. 

By careful and exhaustive examination of all the cases, we have proved 
that there is no way to to keep £k hidden from p 3 throughout R. This is a 
contradiction to our initial assumption, and thus eventually risk = 0. I 

These two label pairs, £ 3 and tk, can be the ones that processors p 3 and pk 
name as their local maximal label, as in maXij t k = {(wiaXj\j],maXk[k\)}, or 
recently received from one another, as in ackij^k = {(jnaxj\j],maxk\j})}- These 
two cases also appear when considering the communication channel (or buffers) 
from pk to p 3 , as in hName^j^ = {{£j,£k) '■ £j = rnax 3 [j ] A (3(£k, 9 ) G Rk,j)} 
and hAckij^k = {{£j,£k) '■ £j = maxj\j] A (3 (»,£k) G Rk,j)}- We also note the 
case in which pk stores a label pair that might disturb the one that p 3 names 
as its (local) maximal, as in storedijk — {{ max j\j]} x storedLabelsk[i ]}• 

Lemma 4.6 Suppose that risk = 0 in every configuration throughout R and 
that there exists at least one processor, p un known G P whose identity is unknown, 
that takes practically infinite number of steps in R. Then pj never adopts labels 
(line 21 ), £j : (£j =icreator i)> from Pi’s unknown domain (£j ^ labels 3 (£ 3 )). 

Proof. Note that the definition of risk considers almost every possible combina¬ 
tion of two label pairs £j and £k from p, \s domain that are stored by processor 
p 3 , and respectively, pk (or in the channels to them). The only combination 
that is not considered is (£j,£k) G storedLabelsj[i\ x storedLabelsk[i ]■ However, 
this combination can indeed reside in the system during a legal execution and 
it cannot lead to a disruption for the case of risk = 0 in every configuration 
throughout R because before that could happen, either pj or pk would have to 
adopt £j, and respectively, £k , which means a contradiction with the assumption 
that risk = 0. 
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The only way that a label in storedLabels\\ can cause a change of the lo¬ 
cal maximum label and be communicated to also disrupt the system, is to 
find its way to max[}. Note that pj cannot create a label under pfis domain 
(line 28) since the algorithm does not allow this, nor can it adopt a label from 
storedLabelSj[i\ (by the definition of legitLabels{ ), line 14). So there is no way 
for £j to be added to rnaXj [j] and thus make risk 0 through creation or 
adoption. 

On the other hand, we note that there is only one case where pk extracts 
a label from storedLabelsk[i] : i jfi k and adds it to maxi-[j] ■ This is when it 
finds a legit label tj £ maXk[j] that can be canceled by some other label Ik in 
storedLabelsk[i\) , line 26. But this is the case of having the label pair (£j,£k) 
in storedij'k- Our assumption that risk = 0 implies that storedi= 0- This 
is a contradiction. Thus a label Ik cannot reach maxk [] in order for it to be 
communicated to pj. 

In the same way we can argue for the case of two messages in transit, TLj.k x 
TLk,j and that risk = 0 throughout R. ■ 

Lemma 4.7 Suppose that risk = 0 in every configuration throughout R and 
that there exists at least one processor, p U nknown € P whose identity is unknown, 
that takes practically infinite number of steps in R. There is a legitimate label 
<maxi such that for any processor p t £ P (that takes a practically infinite num¬ 
ber of steps in R), it holds that maxfii] = f max . Moreover, for any processor 
Pj £ P (that takes a practically infinite number of steps in R), it holds that 
(( maxi[j ] <ib 4nax) A ((W £ storedLabelsfij ] : legit(£)) => (£ <i b t max ))). 

Proof. We initially note that the two processors Pi,Pj that take an infinite 
number of steps in R will exchange their local maximal label maxi [*] and maxj [j] 
an infinite number of times. By the assumption that risk = 0, there are no two 
label pairs in the system that can cause canceling to each other that are unknown 
to pi or pj and are still part of maxi[i\ or maxfij]. Hence, any differences in 
the local maximal label of the processors must be due to the labels’ ICreator 
difference. 

Since max, [i] and maxj [j] are continuously exchanged and received, assum¬ 
ing maxi[i\ -<ib maxj{j] where the labels are of different label creators, then 
Pi will be led to a receive () event of (sentMaXj,lastSentj) where maxfii] -<i b 
sentMaXj. By line 19, sentMaxj is added to max, [j] and since risk = 0 no 
action from line 20 to line 26 takes place. Line 27 will then indicate that the 
greatest label in maxfi*] is that in maxfij] which is then adopted by pi as 
maxi[i], i.e., pfi s local maximal. The above is true for every pair of processors 
taking an infinite number of steps in R and so we reach to the conclusion that 
eventually all such processors converge to the same £ m ax label, i.e., it holds that 
((maxi\j] <i b f max ) A ((W £ storedLabelsfij] : legit{£)) => (£ <i b £ max ))). ■ 

4.2.5 Convergence 

Theorem 4.2 combines all the previous lemmas to demonstrate that when start¬ 
ing from an arbitrary starting configuration, the system eventually reaches a 
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configuration in which there is a global maximal label. 

Theorem 4.2 Suppose that there exists at least one processor, p un known G P 
whose identity is unknown, that takes practically infinite number of steps in R. 
Within a bounded number of steps, there is a legitimate label pair £ max , such 
that for any processor p., G P (that takes a practically infinite number of steps 
in R), it holds that pi has has maxfii] = £ max . Moreover, for any processor 
Pj € P (that takes a practically infinite number of steps in R), it holds that 
(( maxi\j ] <i b £ max ) A (W G storedLabelsfij] : legit(£)) => (£ <i b £ max ))). 

Proof. For any processor in the system, which may take any (bounded or 
practically infinite) number of steps in R, we know that there is a bounded 
number of label pairs, R = £, 0 , £^ x ,..that processor pi G P adds to the system 
configuration (the else part of line 28), where li k =iCreator i (Lemma 4.4). Thus, 
by the pigeonhole principle we know that, within a bounded number of steps in 
R, there is a period during which p un known takes a practically infinite number 
of steps in R whilst (all processors) Pi do not add any label pair, £ ik =icreator A 
to the system configuration (the else part of line 28). 

During this practically infinite period (with respect to Punknown ), in which 
no label pairs are added to the system configuration due to the else part of 
line 28, we know that for any processor pj G P that takes any number of 
(bounded or practically infinite) steps in R, and processor pk G P that adopts 
labels in R (line 27), ij : (£j =icreator j), from pfi s unknown domain (A, ^ 
storedLabelsk(j)) it holds that pu adopts such labels (line 27) only a bounded 
number times in R (Lemma 4.3). Therefore, we can again follow the pigeonhole 
principle and say that there is a period during which p un known takes a practically 
infinite number of steps in R whilst neither p, adds a label, lj k =iCreator i , to the 
system (the else part of line 28), nor pk adopts labels (line 27), £j : (£j —icreator 
j), frompj’s unknown domain (£j labelsk(£j))- 

We deduce that, when the above is true, then we have reached a configuration 
in R where risk = 0 (Lemma 4.5) and remains so throughout R (Lemma 4.6). 
Lemma 4.7 concludes by proving that, whilst Punknown takes a practically infi¬ 
nite number of steps, all processors (that take practically infinite number of steps 
in R) name the same ^ib-greatest legitimate label which the theorem statement 
specifies. Thus no label £ =iCreator j in maXi[*] or in storedLabelsfij] may 
satisfy £ R b £ max . ■ 

4.3 Increment Counter Algorithm 

In this subsection, we explain how we can enhance the labeling scheme pre¬ 
sented in the previous subsection to obtain a practically self-stabilizing counter 
increment algorithm. 

Counters. To achieve this task, we now need to work with practically un¬ 
bounded counters. As already mentioned in Section 2, a counter cnt is a triplet 
(. Ibl , seqn, wid), where Ibl is an epoch label as defined in the previous subsection, 
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seqn is a 64-bit integer sequence number and wid is the identifier of the proces¬ 
sor that last incremented the counter’s sequence number, i.e., wid is the counter 
writer. Then, given two counters cnti,cntj we define the relation cnti -< c t cntj 
= ( cnti.lbl -<n, cntj.lbl) V (( cnti.lbl = cntj.lbl) A ( cnti.seqn < cntj .seqn)) V 
(( cnti.lbl = cntj.lbl) A ( cnti.seqn = cntj.seqn) A ( cntj.wid < cntj.wid)). Ob¬ 
serve that when the labels of the two counters are incomparable, the counters 
are also incomparable. 

Therefore, the relation ~< ct defines a total order (as required by practically 
unbounded counters) only when processors share a globally maximal label, (i.e., 
the system runs within a “stable” epoch). As we have shown in Theorem 4.2, 
starting from an arbitrary configuration, we eventually reach a configuration 
where the active processors have adopted the same maximal label. Essentially, 
the counter increment algorithm enhances the labeling algorithm to take care 
of the counter increment once such a maximal label exists in the system. 

Enhancing the labeling algorithm to handle counters. Recall that in 
the labeling algorithm each processor was maintaining two main structures 
of pairs of labels: array max{} that stored the local maximal labels of each 
other processor (based on the message exchange) and storedLabels[ ], an array 
of queues of label pairs that each processor maintains in an attempt to clean 
up obsolete labels created by itself or other processors. These structures now 
need to contain counters instead of just labels and are renamed to maxC\\ and 
storedCnts\\ (see line 1 of Algorithm 3). Each label can yield many different 
counters with different (seqn, wid). Therefore, in order to avoid increasing the 
size of these queues (with respect to the number of elements stored), we only 
keep the highest sequence number observed for each label (breaking ties with 
wids). We denote a counter pair by (met, cct), with this being the extension of 
a label pair (ml, cl), where cct is a canceling counter for met, such that either 
cct.lbl -fi lb rnct.lbl (i.e., the counter is canceled), or cct.lbl = _L. 

Also, note that if there are counters in the system that are corrupt (being in 
the initial arbitrary configuration), then they can only force a change of label 
if their sequence number is exhausted (i.e., seqn > 2 64 ). Exhausted counters 
are treated by the counter algorithm in a way similar to the canceled labels in 
the labeling algorithm; an exhausted counter met in a counter pair (met, cct) 
is canceled, by setting rnct.lbl = cct.lbl (i.e., the counter’s own label cancels 
it) and hence making the counter non-legit (thus it cannot be used as a local 
maximal counter in maxCi[i]). This cannot increase the number of labels that 
are created due to the initially corrupted ones, as shown in the correctness proof 
that follows. 

Another issue worth mentioning, is that the system is allowed to revert 
back to a previous legit label x, in case the current maximal label y becomes 
canceled. Label x might have been used before to create counters, so it is 
required to store the last sequence number written. If x is legit the system 
should not propose a new label and instead revert to x. Otherwise, the queues 
might grow with no bound. We enable reverting to such an x, by imposing 
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that each processor only stores a single instance of counters with the same 
label inside storedCnts[] , namely the one with the maximal sequence number 
( seqn, wid). This is performed by storing the highest value of a counter that we 
hear about, as performed in line 19 upon a successful quorum write of a new 
sequence value, upon a receipt of any write request (line 31) and in every receipt 
of a counter through receive() by the definition of process(). Namely, in every 
possible appearance of a counter to the local state of a processor. 

Quorums. We define a quorum set Q based on processors in P, as a set of 
processor subsets of P (named quorums), that ensure a non-empty intersection 
of every pair of quorums. Namely, for all quorum pairs Qt,Qj € Q such that 
Qi, Qj C P, it must hold that Qi (~l Qj ^ 0. This intersection property is useful 
to propagate information among servers and exploiting the common intersection 
without having to write a value v to all the servers in a system, but only to a 
single quorum, say Q. If one wants to retrieve this value, then a call to any of 
the quorums (not necessarily Q), is expected to return v because there is least 
one processor in every quorum that also belongs to Q. In the counter algorithm 
we exploit the intersection property to retrieve the currently greatest counter 
in the system, increment it, and write it back to the system, i.e., to a quorum 
therein. Note that majorities form a special case of a quorum system. 

Description of the Counter Algorithm. A pseudocode of the counter 
increment algorithm appears in Algorithm 3. The algorithm shows periodic 
counter operations (lines 12-14) -extending those of the labeling algorithm- 
and the counter increment operations (lines 15-31). The algorithm uses the en¬ 
hanced counter structures maxC[n] and storedCnts[n\ which are maintained in 
the same way as in the labeling algorithm with some additional operations. We 
define the operator enqueue(ctp) (line 3) to add a counter pair ctp to a queue 
of these structures if a corresponding counter with the same Ibl doesn’t exist, 
or to keep only one of the two instances if it exists. There are two enqueuing 
rules: (1) if at least one of the two counters is cancelled we keep a canceled 
instance, and (2) if both counters are legitimate we keep the greatest counter 
with respect to (seqn, wid). The counter is placed at the front of the queue. 

Each processor pi uses the token-based communication to transmit to every 
other processor pj its own maximal counter and the one it currently holds for 
Pj in maxCi[j ] (line 12). Upon receipt of such an update from pj , pi first per¬ 
forms canceling of any exhausted counters in storedCnts[] (line 14), in maxC\\ 
(line 14) and in the received couple of counter pairs (line 14). Having catered 
for exhaustion, it then calls process ((•,•)) with the received two counter pairs 
as arguments. 

The process () operator calls lines 19 to 28 of Algorithm 2 adjusted for 
counter structures and handling counters. Thus, mentions to either labels or 
label structures in the labeling algorithm now refer to counters and counter 
structures. When adding to the counter queues the two enqueuing rules men¬ 
tioned for enqueue () (above) hold. For ease of presentation we assume that 
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Algorithm 3: Counter Increment; code for p, ; 

1 Variables: A label Ibl is extended to the triple (Ibl, seqn, wid ) called a counter where 
seqn, is the sequence number related to Ibl, and wid is the identifier of the creator of this 
seqn. A counter pair (met, cct) extends a label pair, cct is a canceling counter for met, 
such that cct.Ibl -fin, met.Ibl or cct.lbl = _L. We rename structures max\\ and storedLabels[] 
of Alg. 2 to maxC\\ and storedCnts[] that hold counter pairs instead of label pairs. 

2 Operators: process((», •)) - executes the lines 19 to 28 of Algorithm 2 adjusted for 
counter structures and handling counters. For counter pairs with the same met label, only 
the instance with the greatest counter w.r.t. -<. c t is retained. In the case where one counter 
is cancelled we keep the cancelled. For ease of presentation we assume that a counter with a 
label created by pi in line 28 of Algorithm 2, is initiated with a seqn — 0 and wid = i. A 
call of process () (without arguments) essentially ignores lines 19 and 20 of Alg. 2. 

3 enqueue(ctp) - places a counter pair ctp at the front of a queue. If ctp.mct.lbl already 
exists in the queue, it only maintains the instance with the greatest counter w.r.t. -< c t, 
placing it at the front of the queue. If one counter pair is canceled then the canceled copy is 
the one retained. 

4 Notation: Let y and y be two records that include the field x. We denote y = x y = ( y.x 
= v'-x). 

5 Macros: 

6 exhausted(ctp) = ( ctp.mct.seqn > 2 64 ) 

r legit(ctp) = (ctp.cct = _L)) 

8 retCntrQ(ct) : return (storedCnts\ct.lbl.lCreator\) 

9 legitCounters () = {maxC[j\.mct : 3 pj £ P A legit(maxC[j])} 

10 cancel Exhaust ed(ctp) : ctp.cct <— ctp.mct 

11 cancelExhaustedMaxC () : foreach pj £ P, c £ maxC[j] : exhausted(c) do 
cancel Exhaust ed(maxC[j]) getMaxSeq () : return 

max w id({max se qn({ctp : ctp.mct £ legitCounters () A maxC[i\ —mct.ibi ctp})}) 

// Lines 12 to 14 run in the background. 

12 upon tr an smit Ready (pj £ P \ {p^}) do transmit ((maxC[i\, maxC[j]))’, 

13 upon receive((sentMax, lastSent)) from pj begin 

14 foreach pj £ P, ctp £ storedCnts[j] : legit(ctp) A exhausted(ctp) do 

cancel Exhaust ed(ctp) if (3 ctp' £ (sentM ax, last Sent) : exhausted(ctp')) then 
cancel Exhaust ed(ctp ) cancel ExhaustedM axC(); process((sentMax, lastSent))’, 

15 procedure incrementCounter() begin 

16 quorumRead()’, 

17 repeat findMaxCounter()‘, until legit(maxC[i\) A —>exhausted(maxC[i\) let 
newCntr = (maxC [i]. met.Ibl, maxC[i\.met.seqn + l,i); 

is if quorumWrite(newCntr ) then 

19 maxC[i\ <— newCntr’, retCntrQ(maxC[i].mct).enqueue(maxC[i])‘, 

20 procedure quorumRead () begin 

21 foreach pj £ P do send quorumMaxRead() while waiting for responses from a 
quorum do 

22 upon receipt of max J from pj do maxC[j] <— max J ; 

23 upon request for quorumMaxRead () from pj do {findMaxCounter()‘, send maxCi[i] 
to Pj ■ } 

24 procedure findMaxCounterf) begin 

25 I cancelExhaustedMaxC (); process()’, 

26 maxC[i] <— getMaxSeq (); 

27 procedure quorumWrite(maxCi[i\) begin 

28 I foreach pj £ P do send quorumMaxWrite(maxCi[i\) wait for ACK from a 

|_ quorum 

29 upon request for quorumM axW r it e(max J ) from pj begin 

30 maxCi[j] <— max c t(max J ,maxCi[j])] 

3 1 if maxj = ibi.iCreator i then storedCntSi[i\.enqueue(maxCi[i ]) if 
exhausted(maxCi[j ]) then cancelExhausted(maxCi\j]) send ACK to pj; 





a counter with a label created by pi in line 28 of Algorithm 2, is initiated 
with a seqn = 0 and wid = i. A call to process{ ) (without arguments) es¬ 
sentially ignores lines 19 and 20 of Algorithm 2 and executes the rest of the 
lines performing bookkeeping tasks. After this call to processQ , any exhausted 
counters from the initial arbitrary configuration, are enqueued as canceled to 
storedCnts[ ]. Therefore, they can never be readopted in case they are proposed 
with a non-exhausted counter. 

The increment counter algorithm executed in lines 15 to 19 follows the logic 
of a writer in a MWMR register emulation. Processor pi inquires the system for 
the counter they believe as greatest (line 16) by calling procedure quorumReadQ 
(lines 20-22). The responses contain the counter ( maxj ) that the responding 
processor pj regards as the greatest (line 23). pi aggregates the responses in 
its maxC [] array. Note that there can be background counter diffusion as well. 
The quorumReadQ returns only when all the processors of one of the quorums 
have sent their responses (excluding responses from diffusion). 

When the quorumReadQ completes, the findMaxCounterQ procedure is 
called repeatedly until a counter that is not canceled or exhausted is found; 
all counters that are exhausted must eventually become canceled. The func¬ 
tion findMaxCounterQ cancels any exhausted counters in maxC[] (while it 
holds the input from the quorum), and then calls processQ (line 25) to perform 
bookkeeping based on the new information and to provide a valid label. When 
the system is stabilized this label should not change. Any corrupt exhausted 
counter that might not have been canceled in the storedCnts\\ will, through the 
new call on processQ, become canceled, making pi immune from adopting it if 
it is proposed by other processors as valid. The getMaxSeqQ macro returns the 
maximal per -< ct , legit, non-exhausted counter it finds locally inside maxCi []. 
On exiting the loop (lines 17 17), the counter in maxCi[i\ is the greatest of the 
counters returned by the quorum and any other processor (through diffusion), 
or, in case such a counter was not found, it is a newly created counter. As 
already stated such a counter is initiated to seqn = 0 and wid = i. 

Following this, a local copy of maxCQi] is incremented, i.e., the sequence 
number is increased by one, and wid is set to the identifier of pi (line 17). The 
processor then attempts a write to the system (line 18) expecting responses 
from a quorum to return (line 27). Every processor pj receiving pQs quorum 
write request, places it in maxCj[i] if it is greater than the value it already has 
in maxCi[j] and cancels it if it is exhausted. If the write fails for any reason to 
gather acknowledgments, the value does not get written to the local state as it 
does not satisfy the if condition of line 18. 

Proof of correctness. We now prove the correctness of the counter algo¬ 
rithm. Initially we prove, that starting from an arbitrary configuration the 
system eventually reaches to a global maximal label (as given in Theorem 4.2), 
even in the presence of exhausted counters. We then continue to show that given 
such a global maximal label, the related counters are guaranteed to increment 
monotonically. 
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Lemma 4.3 Consider two processors pi taking a practically infinite number of 
steps and a setting as described by Theorem 4-2, adjusted for labels rather than 
counters as described above. Algorithm 3 guarantees that, within a bounded 
number of steps, every processor Pi holds a counter ct in maxCi[i\ that has 
ct.lbl = (-max the globally maximal label and £ max is not exhausted. More¬ 
over, tmax is the greatest of all legitimate counter pair labels in maxCi {] and 
storedCntSiW. 

Proof. The proof follows the flow of the labeling algorithm proof, and provides 
minor amendments wherever the use of counters (instead of labels) challenges 
the correctness of the arguments. We show how the counter operations ensure 
that we reach to the globally maximal label Umax becoming adopted by all the 
processors that take a practically infinite number of steps in execution R. We 
only require that Ibl = t max while seqn and wid may differ. 

Key observation. Upon a receive event (lines 13-14) of the increment counter 
algorithm, lines 14, 14 and 14 cancel any exhausted counter pairs appearing 
as legitimate in storedCnts \\, maxC\\ and among the two received counter 
pairs by setting their met as their cct. Increment counter procedures also 
have incoming counters. We note that any exhausted non-canceled counters 
stored in maxCi[] by a quorumReadf), are canceled by the immediate call 
of cancelExhaustedMaxC () in line 25 (through the call on findMaxCntrQ 
of line 17). Incoming counters through quorumWrite () are also immediately 
checked for exhaustion on line 31. 

In line with Lemma 4.1 we require that a full execution of a receive event 
has taken place, i.e., all lines 13 to 14 have been executed at least once. We 
now prove that all lemmas up to Lemma 4.4 in the labeling scheme’s correct¬ 
ness proof remain unaltered if we extend labels to counters and assume that 
the arbitrary state contains exhausted counters. The case of adopting an ex¬ 
hausted label which is then canceled, is an additional case in the body of the 
proof of Lemma 4.4 since all the other assumptions remain the same. Consider 
some processor pi £ P taking an infinite number of steps in execution R and 
assigning the label £ x of an exhausted counter ct x as maxCi[i\. This implies 
that t x was not canceled when line 27 of Algorithm 2 was executed. By our 
key observation, any counter in the local state is checked for exhaustion and 
canceled immediately. By the assumption that at least one iteration of receive 
has taken place, we deduce that l x was adopted while canceled contradicting 
the conditions of line 27 of Algorithm 2 and the labeling algorithm proof. Thus, 
after a single iteration of receive it is impossible to adopt an exhausted label. 

Exhausted counters cannot therefore increase adoptions and they pose no 
requirement for increasing the counter queue size, since we only keep a single 
instance of this canceled object. We note that once the canceling operations 
on exhausted counters take place, the call to process ensures that the canceled 
copies of these counters are retained in the storedCnts\\. Any new occurrences of 
these counter labels in maxC {] are canceled by the corresponding canceled copies 
in storedCnts\\. From the arguments for label pair diffusion, which are identical 
for the counter pairs being diffused, any processor holding a counter ct x as its 
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local maximal counter that is exhausted in the local state of some other active 
processor pj , eventually stops using ct x in favor of a counter with a different non- 
exhausted label. Following the results of the labeling algorithm, we deduce that 
our cancellation policy on the exhausted counters, enables Theorem 4.2 to also 
include the use of counters without any need to locally keep more counters than 
there are labels. By this theorem, we deduce that, eventually, any processor 
taking a practically infinite number of steps in R will have a counter with the 
globally maximal label t max ■ B 

Theorem 4.4 Given an execution R of the counter increment algorithm in 
which at least a majority of processors take a practically infinite number of 
steps, the algorithm ensures that counters eventually increment monotonically. 

Proof. Given a suffix R' of the execution R in which Lemma 4.3 holds through¬ 
out, we define ct max to be the counter with the globally maximal label that is 
the greatest in the system with respect to ( seqn,wid ). There are two cases: 
Case 1: ct max is the result of a call to the incrementCounterQ pro¬ 
cedure. Since this procedure only returned when quorumWrite(ct max ) took 
place (line 28), therefore a quorum acknowledged the writing of this value. By 
the intersection property of the quorums, this counter was made known to at 
least one processor of every quorum. If there are concurrent writings of counters 
with the same seqn then the one with the greatest wid ensures monotonicity. 
Any subsequent call to incrementCounter () and thus to quorumReadQ will, 
again by the intersection property of the quorums, return at least one instance 
of ctmax , since there is at least one processor in every quorum that acknowl¬ 
edged this counter. 

Case 2: ct max comes from the arbitrary state. By Lemmas 4.5, 4.6 and 
4.7, the risk of having a label that remains hidden and that can cause a cancel¬ 
lation eventually becomes zero. We have previously used this proof to enforce 
that all exhausted counters eventually become canceled or are eliminated from 
the system. In the same vein we treat the case where ct max is a remote counter 
that was not written to a quorum but may be revealed at some point to the 
system. Note that such a counter has the global maximal label and can indeed 
be adopted as a highest counter, since the adoption of this counter does not 
violate the monotonicity of counters, even if we go from one sequence number 
to a much greater one. 

We also note, that this counter may have a sequence number near exhaustion. 
By the arguments of Case 1, the increments after this counter is adopted are 
monotonic and this will cause exhaustion of the counter requiring a label change 
in a number of increment steps that is not practically infinite. We have to 
mention here that this event does not increase the number of label creations, as 
the number of such counters that can cause eventual cancellation by exhaustion 
(after not practically infinite counter increments) is accounted for in the number 
of labels that can exist in the initial arbitrary state. The proof follows from our 
treatment of exhausted counters of Lemma 4.3. 

Recall that our algorithm allows processor pi to readopt a counter cnti with 
pf s own label that has a different label creator than the one it used in the 


29 



previous iteration of the labeling algorithm. Readoptions are only possible when 
cnti bas not been canceled. In the case of such a readoption it is implied that 
anti was dropped in favor of a counter ant' with higher a ICreator identifier 
that was eventually canceled. This implies that ant' must come from the initial 
arbitrary configuration. Hence these “breaks” in monotonicity can only occur 
a bounded number of times in the execution, since counters such as cnt' are 
bounded in number and are handled by the Labeling algorithm. 

Our algorithm stores every incoming counter with a label that was created 
by pi in the storedCntSi[i\ queue and by keeping the instance with the greatest 
( seqn,wid ), (see lines 14, 19 and 31). So if Pi is to backstep to cnti, then the 
greatest instance that p t has learned about cnti is adopted from storedCntSi[i]. 
The only way for a new value of cnti to be missed by Pi, is for pi to not hear 
of a quorum read incrementing pi before cnt' was adopted. Again, as explained 
above, this is attributed to the bounded number of remnant counters from the 
arbitrary configuration that are dealt by the Labeling and Counter algorithms 
as Lemma 4.3 describes. 

Now, under a legal execution where Lemma 4.3 holds, Case 2 can only occur 
a bounded number of times (since the counters in the initial arbitrary state are 
bounded in number). Furthermore, Case 1 is eventually true for the rest of the 
execution. In any case, the increment of the counter is monotonic with respect 
to -< ct in every subsequent call to incrementCounter (). ■ 

Having a self-stabilizing counter increment algorithm, it is not hard to imple¬ 
ment a self-stabilizing MWMR register emulation. Each counter is associated 
with a value and the counter increment procedure essentially becomes a write 
operation: once the maximal counter is found, it is increased and associated 
with the new value to be written, which is then communicated to a majority of 
processors. The read operation is similar: a processor first queries all processors 
about the maximum counter they are aware of. It collects responses from a ma¬ 
jority and if there is no maximal counter, it returns _L so the processor needs to 
attempt to read again (i.e., the system hasn’t converged to a maximal label yet). 
If a maximal counter exists, it sends this together with the associated value to 
all the processors, and once it collects a majority of responses, it returns the 
counter with the associated value (the second phase is a standard requirement 
for preserving the consistency of the register (c.f. [3, 19]). 

5 Virtually Synchronous Stabilizing Replicated 
State Machine 

We now present a self-stabilizing reliable multicast algorithm that provides fault- 
tolerance, with respect to processor crashes and communication asynchrony, by 
considering the (current group) view of the changing processor set at the end of 
the group communication endpoint. We propose a self-stabilizing algorithm that 
guarantees the VS property. Namely, any two processors that are members of 
the same view, ought to deliver identical message sets to their SMRs as long as 
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they continue to share the same view, which indeed may change. This way, SMR 
algorithms can use the multicast service to synchronize their state transitions, 
i.e., the group members multicast their current automaton state, and the last 
received input that had led to that state. 

Overview. A key advantage of multicast services (with virtual synchrony) is 
the ability to reuse the same view during many multicast rounds, and thus every 
automaton step requires a single multicast round. The aim of the proposed 
algorithm is to demonstrate in a self-stabilizing manner the most important 
ways to cut down the number of times in which the service needs to agree on 
a new view, and when it does, to perform it swiftly. Similar to [6], we assume 
that the service works in the network’s primary partition (see Definition 5.1) 
and require that a majority of processors are present in every view set. We do 
not however require all (local) failure detectors to agree on the set of recently 
alive and connected processors. 

Multicast services that provide VS often leverage on the system’s ability to 
preserve (when possible) the coordinator during view transitions rather than 
electing a new coordinator. The motivation here is that the coordinator has the 
most recent automaton state and holds a copy of the set of unstable messages, 
which are the ones that were delivered to at least one view member, but the 
(alive and connected) view members have yet to receive a delivery acknowl¬ 
edgement for these. Our solution naturally follows this approach since it often 
helps the service to abstain from electing a leader upon every view change, as 
well as to avoid view transitions that require the coordinator to first investigate 
about all unstable messages (and the most recent automaton state) among all 
view members that continue to the next view. This is done so that the service 
can provide the virtual synchrony property. Thus, we consider the notion of 
coordinators that a majority of processors never suspects and we show that, 
in the existence of such processors, one of these coordinator will be eventually 
used in all subsequent views (Definition 5.1). As explained in Section 2, the 
algorithm, uses the counter increment algorithm, as well as a reliable multicast 
and a failure detector built over a self-stabilizing FIFO data link. 

Definition 5.1 We say that the output of the (local) failure detectors in exe¬ 
cution R includes a primary partition when it includes a supporting majority 
of processors P ma j ■ Pmaj Q P, that (mutually) never suspect at least one pro¬ 
cessor, i.e., 3 pf £ P for which \P m aj\ > L n /2J an d ( Pi G ( Pmaj O FDf)) -4=> 
(pi € (Pmaj n FDi )) in every c £ R, where FD X returns the set of processors 
that according to p x ’s failure detector are active. 


5.1 Detailed Description of Algorithm 4 

The existence of coordinator pi is in the heart of Algorithm 4. Processors that 
belong to and accept pi s view proposal are called the followers of pi. The 
algorithm determines the availability of a coordinator and acts towards the 
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Algorithm 4: A self-stabilizing automaton replication using virtual syn¬ 
chrony, code for processor pi 

1 Constants: PCE (periodic consistency enforcement) number of rounds between global 
state check; 

2 Interfaces: fetch() next multicast message, apply(state, msg) applies the step msg to 
state (while producing side effects), synchState(replica ) returns a replica consolidated 
state, synchMsgs(replica) returns a consolidated array of last delivered messages, 

f ailure Detect or () returns a vector of processor pairs (pid, crdl D ), inc( ) returns a counter 
from the increment counter algorithm; 

3 Variables: rep[n] = (view = (ID, set), status £ {Multicast, Propose, Install}, (multicast 
round number) rnd, (replica) state, (last delivered messages) msg[n] (to the state 
machine ), (last fetched) input (to the state machine ), propV = (ID, set), (no 
coordinator alive) noCrd, (recently live and connected component) FD) : an array of 
state replica of the state machine, where rep[i\ refers to the one that processor pi 
maintains. A local variable FDin stores the f ailure Detect or () output. FD is an alias for 
{ FDin.pid }, i.e. the set of processors that the failure detector considers as active. Let 
crd(j) = {FDin.crdlD : FDin.pid = j }, i.e. the id of Pj’s local coordinator, or X if none. 

4 Do forever begin 

5 let FDin = / ailure Detector (); 

6 let seemCrd — {p£ = rep[£].propV.ID.wid £ FD : (| rep[£].propV.set | > L n /2j) A 
(\rep[£].FD\ > \n/2\) A (p£ £ rep[£].propV.set) A (pk £ rep[£].propV. set «-»■ p^ £ 
rep[k\.FD) A ((rep[£].status = Multicast) —>■ (rep[£].(view = 

propV) A crd(£) = £)) A ((rep[£\.status = Install) —>■ crd(£) = £)}; 

7 let valCrd = {p£ £ seemCrd : (\/pk £ seemCrd : rep[k\.propV.ID ^ct 
rep[£].propV.I D)}', 

8 noCrd <— (\valCrd\ 7 ^ 1); crdID <— valCrd; 

9 if (\FD\ > ln/2\) A (((\valCrd\ ^ 1) A (\{p k £ FD : Pi £ rep[fe].FD A 
rep[k].noCrd}\ > |_ n /2j)) V ((valCrd = {pi}) A (FD 7 ^ propV.set) A (|{pfc £ FD : 
rep[fc]. propV = propV} \ > |_ n /2_|))) then (status, propV) <— (Propose, (inc(), FD)) 

10 else if (valCrd = {pi}) A (V pj £ view.set : rep[j].(view, status, rnd) = (view, 
status, rnd)) V ((status 7 ^ Multicast) A (V pj £ propV.set : 

rep[j].(propV, status) = (propV, Propose)) then 

11 if status — Multicast then 

12 apply (state, msg)', input <— fetch()', 

13 foreach pj £ P do if pj £ view.set then msg[j] <— rep[j].input else 
msg[j] <— X rnd <— rnd + 1 ; 

14 else if status = Propose then 

(state, status, msg) <— (synchState(rep), Install, synchMsgs(rep)) else if 
status = Install then (view, status, rnd) <— (propV, Multicast, 0) 

15 else if 

valCrd — {pe.} A £ 7 ^ i A ((rep[£].rnd = 0 V rnd < rep[£\.rnd V rep[£].(view 7 ^ propV)) 

then 

16 if rep[£].status = Multicast then 

it if rep[£].state = X then rep[£].state <— state /* PCE optimization, line 21 */ 

rep[i\ <— rep[£]', apply (state, rep[£].msg); /* for the sake of side-effects */ 
is input <— fetch(); 

19 else if rep[£].status = Install then rep[i] <— rep[£] else if rep[£].status = Propose 
then (status, propV) <— rep[£].(status, propV) 

20 let m = rep[i\ /* sending messages: all to coordinator and coordinator to all */ ; 

21 if status = Multicast A rnd(mod PCE) 7 ^ 0 then m.state <— X /* PCE optimization, 
line 17 */ 

22 let sendSet = (seemCrd U {p k £ propV.set : valCrd = {p*}} U {pk £ FD : noCrd V 
(status = Propose)}) 

23 foreach pj £ sendSet do send(m) 

24 Upon message arrival m from pj do rep[j] <— m; 





election of a new one when no valid such exists (lines 5 to 9). The pseudocode 
details the coordinator-side (lines 10 to 14) and the follower-side (lines 15 to 19) 
actions. At the end of each iteration the algorithm defines how pe and its 
followers exchange messages (lines 21 to 24). 

The processor state and interfaces. The state of each processor includes 
its current view, and status = {Propose, Install, Multicast}, which refers to usual 
message multicast operation when in Multicast, or view establishment rounds 
in which the coordinator can Propose a new view and proceed to Install it once 
all preparations are done (line 3). During multicast rounds, rnd denotes the 
round number, state stores the replica, msg[n\ is an array that includes the 
last delivered messages to the state machine, which is the input fetched by each 
group member and then aggregated by the coordinator during the previous mul¬ 
ticast round. During multicast rounds, it holds that propV = view. However, 
whenever propV ^ view we consider propV as the newly proposed view and 
view as the last installed one. Each processor also uses noCrd and FD to indi¬ 
cate whether it is aware of the absence of a recently active and connected valid 
coordinator, and respectively, of the set of processor present in the connected 
component, as indicated by its local failure detector. The processors exchange 
their state via message passing and store the arriving messages in the replica’s 
array, rep[n\ (line 24), where rep[i].(view, ..., noCrd) is an alias to the afore¬ 
mentioned variables and rep[j] refers to the last arriving message from processor 
Pj containing pj’s rep[j}. Our presentation also uses subscript ;■ to refer to the 
content of a variable at processor p^, e.g., repk[j].view , when referring to the 
last installed view that processor pk last received from pj. 

Algorithm 4 assumes access to the application’s message queue via fetch(), 
which returns the next multicast message, or T when no such message is avail¬ 
able (line 2). It also assumes the availability of the automaton state transition 
function, apply (state, ms g), which applies the aggregated input array, msg, to 
the replica’s state and produces the local side effects. The algorithm also col¬ 
lects the followers’ replica states and uses synchState(replica) to return the 
new state. The function / ailureDetector () provides access to pi s failure detec¬ 
tor, and the function inc() (counter increment) fetches a new and unique (view) 
identifier, ID, that can be totally ordered by A ct and ID.wid is the identity 
of the processor that incremented the counter, resulting to the counter value 
ID (hence view IDs are counters as defined in Section 4.3). Note that when 
two processors attempt to concurrently increment the counter, due to symmetry 
breaking, one of the two counters is the largest. Each processor will continue to 
propose a new view based on the counter written, but then (as described below) 
the one will the highest counter will succeed (line 7). 

Determining coordinator availability. Algorithm 4 takes an agile ap¬ 
proach to message multicasting with atomic delivery guarantees. Namely, a new 
view is installed whenever the coordinator sees a change to its local failure de¬ 
tector, / ailure Detector (), which p, stores in FD. t (line 5). Processor p, ; can see 
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the set of processors, seemCrdi, that each “seems” to be the view coordinator, 
because pi stored a message from pg € FDj for which pg = rep[£].propV.ID.void. 
Note that pi cannot consider pt as a (seemly) coordinator when pt s proposal 
view does not include a majority, or if p( is not a member in the view it claims 
to coordinate. In the case of Multicast rounds, their view fields must match 
their view proposal fields (line 6). Also, using the failure detector heartbeat 
exchange, processors communicate the identifier of the processor they consider 
to be their coordinator, or _L if none. As shown in the correctness proof, this 
helps to detect initially corrupted states where a processor pi might consider 
processor p 3 to be its coordinator, but processor pj does not consider itself to 
be the coordinator. 

The algorithm considers a processor as the valid coordinator, if it belongs 
to seemCrd and has the ^ C £-greatest view identifier among the set of seemly 
coordinators (line 7). Note that the set valCrdi either includes a single pro¬ 
cessor, pi which pi considers to be a valid coordinator, or pi does not consider 
any processor to be a valid coordinator that was recently live and connected 
(line 8). In the latter case, pi will not propose a new view before its (local) 
failure detector indicates that it is within the primary component and that a 
supportive majority of recently live and connected processors also do not ob¬ 
serve the availability of a valid coordinator (line 9). Note that in the case where 
Pi is a valid coordinator, it will create and propose a new view whenever the 
last proposed view does not match the set of processors that were recently live 
and connected according to its (local) failure detector. In such a case no other 
processor but pi may propose, because it is the only one that retains a majority 
of processors that have accepted the previous view. 

The coordinator-side. Processor pi is aware of its valid coordinatorship 
when (valCrdi = {p;}) (line 10). It takes action related to its tole as a co¬ 
ordinator when it detects the round end, based on input from other processors. 
During a normal Multicast round, pt observes the round end once for every 
view member pj it holds that {repi[j].(view, status, rnd) = ( viewi, statusi, 
rndi)). For the case of Propose and Install rounds, the algorithm does not need 
to consider the round number, rnd. 

Depending on its status, the coordinator pi proceeds once it observes the 
successful round conclusion. At the end of a normal Multicast round, the co¬ 
ordinator increments the round number after aggregating the followers’ input 
(line 11). The coordinator continues from the end of a Propose round to an 
Install round after using the most recently received replicas to install a synchro¬ 
nized state of the emulated automaton (line 14). At the end of a successful 
Install round, the coordinator proceeds to a Multicast round after installing the 
proposed view and the first round number. (Note that implicitly the coordinator 
creates a new view if it detects that the round number is exhausted (rnd > 2 64 ), 
or if there is another member of its view that has a greater round number than 
the one this coordinator has. This can only be due to corruption in the initial 
arbitrary state which affected rnd part of the state.) 
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The follower-side. Processor pi is aware of its coordinator’s identity when 
(■valCrdi = {pe}) and i I (line 15). Being a follower, p t only enters this block 
of the pseudocode when it receives a new message, i.e., the first message round 
when installing a new view (rep[l\.rnd = 0), the first time a message arrives 
(rnd < rep[i\.rnd ) or a new view is proposed ( rep[P].(view ^ propV)). 

During normal Multicast rounds (line 16) the follower pi applies the aggre¬ 
gated message of this round to its current automaton state so that it produces 
the needed side-effects before adopting the coordinator’s replica (line 19). Note 
that, in the case of a Propose round, the algorithm design stops pi from over¬ 
writing its round number, thus allowing the coordinator to know what was the 
last round number that it delivered during the last installed view. 

The exchanging message and PCE optimization. Each processor peri¬ 
odically sends its current replica (line 23) and stores the received ones (line 24). 
As an optimization, we propose to avoid sending the entire replica state in ev¬ 
ery Multicast round. Instead, we consider a predefined constant, PCE (periodic 
consistency enforcement), that determines the maximum number of Multicast 
rounds during which the followers do not transmit their replica state to the co¬ 
ordinator and the coordinator does not send its state to them (lines 17 and 21). 
Note that the greater the PCE’s size, the longer it takes to recover from tran¬ 
sient faults. Therefore, one has to take this into consideration when extending 
the approach of periodic consistency enforcement to other elements of replica, 
e.g., in view and propV, one might want to reduce the communication costs 
that are associated with the set field and the epoch part of the ID field. 

5.2 Correctness Proof of Algorithm 4 

The correctness proof shows that starting from an arbitrary state in an execution 
R of Algorithm 4 and once the primary partition property (Definition 5.1) holds 
throughout R , we reach a configuration c € R in which some processor with 
supporting majority pe will propose a view including its supporting majority. 
This view is either accepted by all its member processors or in the case where 
pe experiences a failure detection change, it can repropose a view. We conclude 
by proving that any execution suffix of R that begins from such a configuration 
c will preserve the virtual synchrony property and implement state machine 
replication. We begin with some definitions. 

Once the system considers processor pe as the view coordinator (Defini¬ 
tion 5.1) its supporting majority can extend the support throughout R and 
thus pe continues to emulate the automaton with them. Furthermore, there is 
no clear guarantee for a view coordinator to continue to coordinate for an un¬ 
bounded period when it does not meet the criteria of Definition 5.1 throughout 
R. Therefore, for the sake of presentation simplicity, the proof considers any 
execution R with only definitive suspicions, i.e., once processor pe suspects pro¬ 
cessor pj, it does not stop suspecting pj throughout R. The correctness proof 
implies that eventually, once all of R's suspicions appear in the respective local 


35 



failure detectors, the system elects a coordinator that has a supporting majority 
throughout R. 

Consider a configuration c in an execution R of Algorithm 4 and a processor 
Pi £ P. We define the local (view) coordinator of Pi, say pj, to be the only 
processor that, based on p,’s local information, has a proposed view satisfying 
the conditions of lines 6 and 7 such that valCrd = {pj}. Pj is also considered 
the global (view) coordinator if for all pk in pj’s proposed view ( propVj ), it 
holds that valCrdk = {Pj}- When p, has a (local) coordinator then p)s local 
variable noCrd = False, whilst when it has no local coordinator, noCrd = True. 
Moving to the proof, we consider the following useful remark on Definition 5.1 
of page 31. 

Remark 5.1 Definition 5.1 suggests that we can have more than one processor 
that has supporting majority. In this case, it is not necessary to have the same 
supporting majority for all such processors. Thus for two such processors Pi,Pj 
with respective supporting majorities P m aj{i ) and P m aj{j ) we do not require that 
Pmajif) = Pmajij), but P ma j{i) H Pmajti) 0 trivially holds. 

Lemma 5.1 Let R be an execution with an arbitrary initial configuration, of 
Algorithm 4 such that Definition 5.1 holds. Consider a processor pi € P m aj 
which has a local coordinator pk, such that pk is either inactive or it does not 
have a supporting majority throughout R. There is a configuration c £ R, after 
which pi does not consider pto be its local coordinator. 

Proof. There are the two possibilities regarding processor pk . 

Case 1: We first consider the case where pk is inactive throughout R. By 
the design of our failure detector, p, is informed of pC s inactivity such that 
line 5 will return an FDi to p t where pk (( FDi. The threshold we set for our 
failure detector (see Section 2) determines how soon pk is suspected. By the first 
condition of line 6 we have that pk FDi => Pk ^ seemCrd => pk $. valCrdi , 
i.e., pi stops considering pk as its local coordinator. By definitive suspicions, 
that Pi does not stop suspecting p *. throughout R. 

We now turn to the case where pk is active, however it does not have a sup¬ 
porting majority throughout R , but p, still considers pk as its local coordinator, 
i.e. valCrdi = {pk}- Two subcases exist: 

Case 2(a): pk considers itself to have a supporting majority, and p, £ propVk- 
Note that the latter assumption implies that pk is forced by lines 20 - 23 to 
propagate repk[k\ to p, in every iteration. By the failure detector, there exists 
an iteration where pk will have |FDk = n/2 + lj and is informed that some 
Pj £ propVk has pk (j FDj and so the condition of line 6 (FD > [n/2_|) fails 
for pk, which stops being the coordinator of itself. If pk does not find a new 
coordinator, hence noCrdk = True, then pk propagates its repk[k\ to p,;. But 
this implies that p,; receives repk[k) and stores it in rep,[fc]. Upon the next 
iteration of this reception, p,; will remove pk from its seemCrd set because pk 
does not satisfy the condition \repi[k].FD\ < |_n/2j of line 6. We conclude 
that pi stops considering pk as its local coordinator if pk does not find a new 
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coordinator. Nevertheless, pk nray find a new coordinator before propagating 
repk [A;]. If pk has a coordinator other than itself, then it only propagates repu [A;] 
to its coordinator and thus Pi does not receive this information. We thus refer 
to the next case: 

Case 2(b): has a different local coordinator than itself. This can occur either 

as described in Case 2(a) or as a result of an arbitrary initial state in which pi 
believes that pk is its local coordinator but pk has a different local coordinator. 
We note that the difficulty of this case is that pk only sends repk[k] to its 
coordinator, and thus the proof of Case 2(a) is not useful here. As explained in 
Algorithm 4, the failure detector returns a set with the identities (pid) of all the 
processors it regards as active, as well as the identity of the local coordinator 
of each of these processors. As per the algorithm’s notation, the coordinator of 
processor pk is given by crd(k). Since pf s failure detector regards pk as active, 
then crd(k ) is indeed updated (remember that p t receives the token with pV s 
crd{k) infinitely often from p *.), otherwise pk is removed from FD and is not a 
valid coordinator for pj. But pk does not consider itself as the coordinator (by 
the assumption of Case 2(b)), and thus it holds that crd(k) ^ k. Therefore, in 
the first iteration after pi receives crd(k) ^ k , one of the last two conditions 
of line 6 fails (depending on what is the view status that pt has in repi[k]) so 
Pk $. seemCrdi and thus valCrdi ^ {Pk}- We conclude that any such pk stops 
being pfs coordinator and by the assumption of definitive suspicions we reach 
to the result. It is also important to note that pk never again satisfies all the 
conditions of line 9 to create a new view. ■ 

We now define the notion of “propose” more rigorously to be used in the 
sequel. 

Definition 5.2 Processor pe £ P with status = Propose, is said to propose 
a view propVe, if in a complete iteration of Algorithm f, pe either satisfies 
valCrde = {pe} or satisfies all the conditions of line 9 to create propVe. A 
proposal is completed when propVe is propagated through lines 20-23 to all the 
members of FDe ■ 

The above definition does not imply that pe will continue proposing the view 
propV, since the replicas received from other processors may force pe to either 
exclude itself from valCrde or create a new view (see Lemma 5.3). If the view 
is installed, then the proposal procedure will stop, although propVe will still be 
sent as part of the replica propagation at the end of each iteration. Also note 
that the origins of such a proposed view are not defined. Indeed it is possible 
for a view that was not created by pe but bears pe’s creator identity to come 
from an arbitrary state and be proposed, as long as all the conditions of lines 6 
and 7 are met. 

Lemma 5.2 If the conditions of Definition 5.1 hold throughout an execution R 
of Algorithm f, then starting from an arbitrary configuration in which there is 
no global coordinator, the system reaches a configuration in which at least one 
processor with a supporting majority will propose a view (with “propose” defined 
as in Definition 5.2). 
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Proof. By Definition 5.1, at least one processor with supporting majority exists. 
Denote one such processor as pi- Assume for contradiction that throughout R, 
no processor pi with supporting majority proposes a view, pi either has a local 
coordinator (that is not global) or does not have a coordinator. 

Case 1: does not have a coordinator (noCrdj = True). If pi does not 

propose a view (as per the “propose” Definition 5.2), this is because it does 
not hold a proposal that is suitable and it does not satisfy some condition of 
line 9 which would allow it to create a new view. The first condition of line 9, 
{\FD\ > |_n/ 2 j) is always satisfied by our assumption that pi is not suspected by 
a majority throughout R. In the second condition, both (i) {[\valCrd(\ ^ 1) A 
(|{pt € FDt : pi G repi[i\.FDi A repi[i\.noCrd}\ > \n/2\)) and (ii) (( valCrdi = 
{pi}) A (FD( propVi.set) A (\{pi G FD : rep(i\.propV = propV}\ > \n/2\)) 
must fail due to our assumption that pi never proposes. Indeed (ii) fails since 
noCrdi = True => valCrdi ^ { pi }. If the first expression also fails, this implies 
that throughout R, pi does not know of a majority of processors with noCrd = 
True and so it cannot propose a new view. 

Let’s assume that only one processor pj G Pmaj(t) C FDi is required to 
switch from noCrdj = False to True in order for pi to gain a majority of proces¬ 
sors without a coordinator. But if noCrdj = False then pj must already have a 
coordinator, say pi-. We have the following two subcases: 

Case 1(a): pk does not have a supporting majority. Lemma 5.1 guarantees 
that pj stops considering pk as its local coordinator. Thus pj eventually goes 
to noCrd = True and by the propagation of its replica, pi receives the required 
majority to go into proposing a view. But this contradicts our initial assump¬ 
tion, so we are lead to the following case. 

Case 1(b): pk has a supporting majority and a view proposal prop 14 from the 
initial arbitrary configuration but is not the global coordinator. But this implies 
that the Lemma trivially holds, and so the following case must be true. 

Case 2: p # has a coordinator, say p^. The two subcases of whether pk> 
has a supporting majority or not, are identical to the two subcases 1 (a) and 
1(b) concerning p that we studied above. Thus, it must be that either pi will 
eventually propose a label, or that pk> has a proposed view, thus contradicting 
our assumption and so the lemma follows. I 

Lemma 5.2 establishes that at least one processor with supporting majority 
will propose a view in the absence of a valid coordinator. We now move to prove 
that such a processor will only propose one view, unless it experiences changes 
in its FD that render the view proposal’s membership obsolete. The lemma 
also proves that any two processors with supporting majority will not create 
views in order to compete for the coordinatorship. 

Lemma 5.3 If the conditions of Definition 5.1 hold tfo'oughout an execution 
R of Algorithm 4, then starting from an arbitrary configuration, the system 
reaches a configuration in which any processor pt with a supporting majority 
proposes a view propvi, and cannot create a new proposed view in R unless 
FDi 7 ^ propVi.set and a majority of processors has adopted propVi- As a 
consequence, the system reaches a configuration in which one processor with 
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supporting majority is the global coordinator until the end of the execution. 
Proof. We distinguish the following cases: 

Case 1: Only one processor with supporting majority exists. Assume 
there is only a single processor pe that has a supporting majority throughout R. 
According to Lemma 5.2, pe must eventually propose a view propVe, based on 
the current FDe reading (line 5) which becomes the propVe-set. By Lemma 5.1, 
any other processor without a supporting majority will eventually stop being 
the focal coordinator of any pj £ propVe-set and since such processors do not 
have a supporting majority, the first condition of line 9 will prevent them from 
proposing. 

Processor pe continuously proposes propVe until all processors in propVe-set 
have sent a replica showing that they have adopted propVe as their propV. 
Every processor that is alive throughout R and in FDe should receive this 
replica through the self-stabilizing reliable communication. The only condition 
that may prevent pj to adopt propVe is if for some p r € repj[i].propVe-set it 
holds that p e ^ repj[r\.FD (line 6). Plainly put, p,j believes that p r suspects pe- 

Case 1(a): Ifp^-’s information is correct about p r , thenp r fL P m aj{£)- Thus 
at some point pe will suspect p r and exclude p r FDe- 

Case 1(b): If p ^s information is false -remnant of some arbitrary state-, 
then pe £ FD r and since p r , by the last condition of line 22, sends rep r [r] 
infinitely often to pj , then repj[r ] will be corrected and Pj will accept propVe. 

Since pe has a majority P m aj{£) C propVe.set, then at least a majority of 
processors have received propVe and eventually accept it. If some processor 
Pj £ propVe does not adopt pf s proposal in R, it is eventually removed from 
FDe and thus does not belong to the supporting majority of pe (as detailed 
in Case 1(a) above). By the above we note that pe is able to get at least the 
supporting majority P ma j(£) to accept its view if not all of the members in 
propVe.set. In the last case it can proceed to the installation of the view. If 
there is any change in the failure detector of pe before it installs a view, pe can 
satisfy the second case of line 9, to create a new updated view. Note that in 
the mean time no processor other than pe can satisfy the conditions of that line, 
and thus it is the only processor that can propose and become the coordinator. 
Thus pe eventually becomes the coordinator if it is the single majority-supported 
processor. 

Case 2: More than one processor with supporting majority. Consider 
two processors pe,Pe' that have a supporting majority such that each creates 
a view (line 9). By the correctness of our counter algorithm, mc() returns 
two distinct and ordered counters to use as view identifiers. Without loss of 
generality, we assume that propVe proposed by pe has the greatest identifier 
of all the counters created by calls to inc(). We identify the following four 
subcases: 

Case 2(a): pe € FDe> A pe> € FDe- In this case pe> will propose its view 
propVe' and wait for all pi € propVp .set to adopt it (line 10). Whenever pe 
receives propVe', it will store it but will not adopt it, since propVe' -ID ^ct 
propVe-ID (line 7). The proposal propVe is also propagated to every pi £ 
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propVe.set. Since there is no greater proposed view identifier than propVe.ID, 
this is adopted by all pi £ propVe which also includes pp as well. Thus any 
processor with supporting majority that belonged to the proposed set of pe will 
propose at most once, and pe will become the sole coordinator. Note that if 
pp is prevented from adopting propVe for some time, this is due to reasons 
detailed and solved in Case 1 of the previous lemma. The case where the failure 
detection reading changes for pe is also tackled as in Case 1 of this lemma, by 
noticing that if pe manages to get a majority of processors of propV.set then pe 
will change its proposed view without losing this majority. 

Case 2(b): pe ^ FDp A pp ^ FDe- Since both processors were able to 
propose, this implies that a majority of processors that belonged to each of 
Pi s and pe'’s supporting majority had informed that they had no coordinator 
(line 9). Each of pe and pp, proposes its view to its propV.set, and waits for 
acknowledgments from all the processors in propV.set (line 10), in order to 
install the view. Since pe ^ FDp, pp does not consider propVe a valid proposal 
(line 6) and retains its own proposal that it propagates. The same is done by pe- 
Since pe has the greatest label, any pi £ propVe.set (~l propVp .set might initially 
adopt propVp but it will eventually choose the greatest propVe■ If pp ! s proposal 
was accepted by all members of propVp then this means that pp became the 
global coordinator but will then lose the coordinatorship to pe because propVe 
has a greater view identifier. 

What is more crucial, is that pp cannot make another proposal, since it 
will not have a majority of processors that do not have a coordinator. This is 
deduced from the intersection property of the two majorities ( propVe.set and 
propVp .set). Since any processor p k in the intersection propVe.set CipropVp .set 
has pe as its coordinator, pp does not satisfy the condition \{p k e FDp : pp £ 
rep[k].FD A rep[k].noCrd}\ > |_rz/2j of line 9, and thus cannot propose a new 
view. Processor pe will install its view and remains the sole coordinator. Also, 
pe is the only one that can change its view due to failure detector change since 
it manages to get a majority of processors in propVe.set as opposed to pp. 

Case 2(c): pe £ FDp A pp ^ FDe- Here we note that since pe has the 
greatest counter but has not included pp to its propVe.set, it should eventually 
be able to get all the processors in propVe.set to follow propVe by using the 
arguments of Case 2(a). In the mean time pp will, in vain, be waiting for a 
response from pe accepting propVp. We note that pp will not be able to initiate 
a new view once propVe is accepted, since it will not be able to gather a majority 
of processors with either noCrd = True or proposed view propVp. 

Case 2(d): pe FDp A pp £ FDe- This case is not symmetric to the above 
due to our assumption that pe is the one that has drawn the greatest view 
identifier from incQ. Here propVe.set includes pp so pe waits for a response 
from pp to proceed to the installation of propVe- On the other hand, pp will be 
waiting for responses from the processors in propVp .set. Any pe £ propVe.set fl 
propVp .set cannot keep propVe (even if initially it has accepted it, since it does 
not satisfy condition pp £ rep[I].propV.set <=e pe £ rep[i'].FD of line 6. Thus 
Pi accepts propVp instead of propVe, Pe cannot propose a different view since it 
will not be able to get a majority of processors that have propVe■ 
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By the above exhaustive examination of cases, we reach to the result. Note 
that the above proof guarantees both convergence and closure of the algorithm to 
a legal execution, since pi remains the coordinator as long as it has a supporting 
majority. ■ 

Theorem 5.4 Starting from an arbitrary configuration, any execution R of Al¬ 
gorithm 4 satisfying Definition 5.1, simulates automaton replication preserving 
the virtual synchrony property. 

Proof. We consider a finite prefix R! of R which has an arbitrary configura¬ 
tion c, and in which there exists a primary partition (as per Definition 5.1). 
Assume that this prefix is sufficiently long for Lemma 5.3 to hold, i.e., to reach 
a configuration c sa f e in which there exists a global coordinator for a majority 
of processors. For this configuration we define a view v that has a coordinator 
pi and that any processor p, £ v that is not the coordinator is a follower of 
pi. We define a multicast round to be a sequence of ordered events: (i) fetchf) 
input and propagate to coordinator, (ii) coordinator disseminates messages to 
be delivered in this new round, (iii) messages delivered and (iv) side effects pro¬ 
duced by all processors. Our proof is broken into three steps that map the three 
possible transitions: 

Step 1: Virtual synchrony is preserved between any two multicast 
rounds. 

Suppose that there exists an input and a related message m in round r that 
is not delivered within r. We follow the multicast round r. First observe the 
following. 

Remark: Within any multicast round, the coordinator executes lines 12 to 13 
only once and a follower executes lines 16 to 18 only once, because the conditions 
are only satisfied the first time that the coordinator’s local copy of the replica 
changes the round number. 

By our Remark we notice that fetchf) is called only once per round to 
collect input from the environment. This cannot be changed/overwritten since 
followers can never access rep[i\ £- rep[£] of line 17 that is the only line modifying 
the input field, unless they receive a new round number greater than the one 
they currently hold. We notice that the followers have produced side effects 
for the previous round (using apply ()) based on the messages and state of the 
previous round. Similarly, the coordinator executes fetchlf) exactly once and 
only before it populates the msg array and after it has produced the side effects 
for the environment that were based on the previous messages (line 12). Line 13 
populates the msg array with messages and including in. The coordinator 
pithen continuously propagates its current replica but cannot change it by the 
Remark and until condition (V pi £ v.set : repe[i].(view, status, rnd ) = ( viewi , 
status^, rndi)) (line 10) holds again. This ensures that the coordinator will 
change its msg array only when every follower has executed line 17 which allows 
the aforementioned condition to hold. 

Any follower that keeps a previous round number does not allow the coordi¬ 
nator to move to the next round. If the coordinator moves to a new round, it is 
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implied that rep[i] <r- rep[t ] and thus message m was received by any follower 
Pi, by our assumptions that the replica is propagated infinitely often and the 
data links are stabilizing. Thus, by the assumptions, any message m is certainly 
delivered within the view and round it was sent in, and thus the virtual syn¬ 
chrony property is preserved, whilst at the same time common state replication 
is achieved. 

Step 2: Virtual synchrony is preserved in two consecutive view in¬ 
stallations where there is no change of coordinator. 

We now turn to the case where from one configuration c sa f e we move to a new 
c 'safe has a different view v' but has the same coordinator pe- Once pe is in 
an iteration where the condition FD ^ propV.set of line 9 holds, a view change is 
required. Since pe is the global coordinator holds, no other processor can satisfy 
the condition (\{pk € FDe : rep[k\e-propV = propVe}\ > (n/2\) of line 9, and 
so only pe- For more on why this holds one can prefer to Lemma 5.2. Processor 
Pe creates a new propV with a new view ID taken from the increment counter 
algorithm, which is greater than the previous established view ID in v.ID. The 
last condition of line 10 guarantees that pe will not execute lines 12 to 14 and 
thus will not change its rep.(state, input, msg) fields, until all the expected fol¬ 
lowers of the proposed view have sent their replicas. Followers that receive the 
proposal will accept it, since none of the conditions that existed change and so 
the new view proposal enforces that valCrd = {pe}- Moreover, the proposal 
satisfies the condition of line 15 and the followers of the view enter status Pro¬ 
pose leading to the installation of the view. What is important is that virtual 
synchrony is preserved since no follower is changing rep.(state, input, msg) dur¬ 
ing this procedure, and moreover each sends its replica to pe by line 22. Once 
the replicas of all the followers have been collected, the coordinator creates a 
consolidated state and msg array of all messages that were either delivered or 
pending, pe' s new replica is communicated to the followers who adopt this state 
as their own (line 19). Thus virtual synchrony is preserved and once all the 
processors have replicated the state of the coordinator, a new series of multicast 
rounds can begin by producing the side effects required by the input collected 
before the view change. 

Step 3: Virtual synchrony is preserved in two consecutive view in¬ 
stallations where the coordinator changes. 

We assume that pe had a supporting majority throughout R' . We define a 
matching suffix R" to prefix R' , such that R" results from the loss of support¬ 
ing majority by pe- Notice that since Definition 5.1 is required to hold, then 
some other processor with supporting majority pee, will by Lemma 5.2 propose 
the view v' with the highest view ID. We note that by the intersection property 
and the fact that a view set can only be formed by a majority set, 3pi € v (~l v'. 
Thus, the “knowledge” of the system, (state, input, msg) is retained within the 
majority. 

As detailed in step 2, if a processor pi had noCrd = True for some time or 
was in status Propose it did not incur any changes to its replica. If it entered 
the Install phase, then this implies that the proposing processor has created a 
consolidated state that pe has replicated. What is noteworthy is that whether 
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in status Propose or Install, if the proposer collapses (becomes inactive or sus¬ 
pected), the virtual synchrony property is preserved. It follows that, once status 
Multicast is reached by all followers, the system can start a practically infinite 
number of multicast rounds. 

Thus, by the self-stabilization property of all the components of the system 
(counter increment algorithm, the data links, the failure detector and multi¬ 
cast) a legal execution is reached in which the virtual synchrony property is 
guaranteed and common state replication is preserved. ■ 


6 Conclusion 

State-machine replication (SMR) is a service that simulates finite automata 
by letting the participating processors to periodically exchange messages about 
their current state as well as the last input that has led to this shared state. 
Thus, the processors can verify that they are in sync with each other. A well- 
known way to emulate SMRs is to use reliable multicast algorithms that guar¬ 
antee virtual synchrony [4, 16]. To this respect, we have presented the first self- 
stabilizing algorithm that guarantees virtual synchrony, and used it to obtain a 
self-stabilizing SMR emulation; within this emulation, the system progresses in 
more extreme asynchronous executions in contrast to consensus-based SMRs, 
like the one in [9], One of the key components of the virtual synchrony algo¬ 
rithm is a novel self-stabilizing counter algorithm, that establishes an efficient 
practical unbounded counter, which in turn can be directly used to implement 
a self-stabilizing MWMR register emulation; this extends the work in [1] that 
implements SWMR registers and can also be considered simpler and more com¬ 
munication efficient than the MWMR register implementation presented in [9]. 
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